Open kanonade opened 1 year ago
Sorry for the late response. The learned embeddings do not translate to a specific word. You could try to find the nearest neighbours in the embedding space, but the learned embeddings typically reside far from real words and the neighbours may make little sense.
Have a look at https://twitter.com/dribnet/status/1554804574132719619 for a possible alternative approach.
If I understand embeddings correctly, they are locations in the latent word space of clip that gives the closest results similiar to a given set of images.
Is it possible then to take an embedding of vector length 1 or more and pass it back through the clip_tokenizer to figure out what words the embedding traslates to? Or if not a real word, use a nearest neighbor approach to find the closest word?
This is not my area, and I got about as far as loading an embedding with pytorch and seeing it is a collection of tensors rather than the list of numbers that the transformer's cliptokenizer takes in its
decode
method. Looking for any insight you might have @rinongal. Thank you!