rinongal / textual_inversion

MIT License
2.87k stars 278 forks source link

Possible to reverse the results? #122

Open kanonade opened 1 year ago

kanonade commented 1 year ago

If I understand embeddings correctly, they are locations in the latent word space of clip that gives the closest results similiar to a given set of images.

Is it possible then to take an embedding of vector length 1 or more and pass it back through the clip_tokenizer to figure out what words the embedding traslates to? Or if not a real word, use a nearest neighbor approach to find the closest word?

This is not my area, and I got about as far as loading an embedding with pytorch and seeing it is a collection of tensors rather than the list of numbers that the transformer's cliptokenizer takes in its decode method. Looking for any insight you might have @rinongal. Thank you!

rinongal commented 1 year ago

Sorry for the late response. The learned embeddings do not translate to a specific word. You could try to find the nearest neighbours in the embedding space, but the learned embeddings typically reside far from real words and the neighbours may make little sense.

Have a look at https://twitter.com/dribnet/status/1554804574132719619 for a possible alternative approach.