42

This exposes a new API method which will return both the intermediate hypotheses and their corresponding embeddings during multi-step generation. If there are multiple beams, I'm only returning the best one.

I played around with a couple of different metrics for 'best beam'. You recommended that I just compare similarity to the ground truth embedding, and take the closest one, but I find this doesn't work well - the intermediate hypotheses that are chosen take 'longer' to converge to the original string. I left these as a comment in corrector.py:L390-L391 in the diff.
Instead I'm using the beam scores returned by self._generate_with_beam. I didn't have much time to check what it's doing (it's comparing the beams to frozen embeddings?). The demo below uses this method.

Open to suggestions/reviews, there is a fair bit of duplicate code but I didn't think it was worth doing a refactor

demo

import vec2text 
import copy
import torch
from vec2text.utils import get_embeddings_openai_vanilla

def compute_cosine_similarity(embeddings1, embeddings2):
    return torch.nn.functional.cosine_similarity(embeddings1, embeddings2, dim=1)

corrector = vec2text.load_pretrained_corrector("text-embedding-ada-002")

text = "Hello my name is Albert"

embed_text = get_embeddings_openai_vanilla(text, model="text-embedding-ada-002")
embed_text = torch.Tensor(embed_text).cuda()

output_strings, hypothesis_embeddings = vec2text.invert_embeddings_and_return_hypotheses(embed_text, corrector, num_steps=10, sequence_beam_width=4)

print("Original text: " + text)

for i, hypothesis_embedding in enumerate(hypothesis_embeddings):
    print(f"Hypothesis string at step {str(i)}: " + output_strings[i][0])
    similarity = compute_cosine_similarity(embed_text, hypothesis_embedding)
    print(f"Cosine similarity to original: {similarity.item()}")

Original text: Hello my name is Albert
Hypothesis string at step 0: Hello my name is Albert Alberto I am a Belgian born scientist and my name is Albert
Cosine similarity to original: 0.9451707005500793
Hypothesis string at step 1: Hello my name is Albert. I am Albert Hi Albert
Cosine similarity to original: 0.9766112565994263
Hypothesis string at step 2: Hello my name is Albert
Cosine similarity to original: 0.9999992251396179
Hypothesis string at step 3: Hello my name is Albert
Cosine similarity to original: 1.000000238418579
Hypothesis string at step 4: Hello my name is Albert
Cosine similarity to original: 0.9999982118606567
Hypothesis string at step 5: Hello my name is Albert
Cosine similarity to original: 1.000000238418579
Hypothesis string at step 6: Hello my name is Albert
Cosine similarity to original: 1.000000238418579
Hypothesis string at step 7: Hello my name is Albert
Cosine similarity to original: 1.000000238418579
Hypothesis string at step 8: Hello my name is Albert
Cosine similarity to original: 0.9999986290931702
Hypothesis string at step 9: Hello my name is Albert
Cosine similarity to original: 1.000000238418579
Hypothesis string at step 10: Hello my name is Albert
Cosine similarity to original: 1.000000238418579

jxmorris12 / vec2text

Return intermediate hypotheses and hypothesis embeddings during generation #44

42

demo