jxmorris12 / vec2text

utilities for decoding deep representations (like sentence embeddings) back to text
Other
673 stars 75 forks source link

Sample to Invert embeddings not working (anymore) #36

Closed zimmermannro closed 6 months ago

zimmermannro commented 6 months ago

Tried to test the "Invert embeddings with invert_embeddings" but doesn't work in Colab - seems OpenAI has changed a bit with the CreateEmbeddingResponse ?

I then tried a sample of my own with newer code from OpenAI, my goal is to calculate the difference between two embeddings and then try to get some text from this "Delta-Embedding". Code looks like that but an assertion error is thrown in inversion.py: +++++ CODE:

import torch

from openai import OpenAI client = OpenAI()

def get_embedding(text, model="text-embedding-ada-002") -> torch.Tensor: text = text.replace("\n", " ") return torch.tensor(client.embeddings.create(input = [text], model=model).data[0].embedding)

test1 = "The king is dead" test2 = "The queen is dead"

outputEmbedding1 = get_embedding(test1, model="text-embedding-ada-002") outputEmbedding2 = get_embedding(test2, model="text-embedding-ada-002")

difference = torch.sub(outputEmbedding1, outputEmbedding2)

print(test1, test2) print(outputEmbedding1) print(outputEmbedding2) print(difference)

corrector = vec2text.load_pretrained_corrector("text-embedding-ada-002")

vec2text.invert_embeddings( embeddings=difference, corrector=corrector)

++++ Output (done in Colab):

The king is dead The queen is dead tensor([-0.0026, -0.0026, -0.0074, ..., -0.0007, 0.0128, -0.0038]) tensor([-0.0150, 0.0011, -0.0189, ..., 0.0103, 0.0027, -0.0165]) tensor([ 0.0124, -0.0038, 0.0114, ..., -0.0110, 0.0101, 0.0127])

/usr/local/lib/python3.10/dist-packages/transformers/models/t5/tokenization_t5_fast.py:160: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5. For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with truncation is True.


AssertionError Traceback (most recent call last)

in <cell line: 25>() 23 corrector = vec2text.load_pretrained_corrector("text-embedding-ada-002") 24 ---> 25 vec2text.invert_embeddings( 26 embeddings=difference, 27 corrector=corrector)

3 frames

/usr/local/lib/python3.10/dist-packages/vec2text/models/inversion.py in embed_and_project(self, embedder_input_ids, embedder_attention_mask, frozen_embeddings) 225 if frozen_embeddings is not None: 226 embeddings = frozen_embeddings --> 227 assert len(embeddings.shape) == 2 # batch by d 228 elif self.embedder_no_grad: 229 with torch.no_grad():

AssertionError:

jxmorris12 commented 6 months ago

I'm thinking your issue is that difference is a one-dimensional vector but it needs to have a batch dimension. Can you try making this change to your code:

vec2text.invert_embeddings(
embeddings=difference[None],
corrector=corrector)

that should add a batch dimension of 1 to the difference embedding, so invert_embeddings won't throw an error.

zimmermannro commented 6 months ago

great, that did the trick - thanks for the fast reaction!