It is possible to extract the actual word embedding vectors when loading embeddings indexes from the Hugging Face Hub?

neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

https://neuml.github.io/txtai

Apache License 2.0

8.99k stars 587 forks source link

It is possible to extract the actual word embedding vectors when loading embeddings indexes from the Hugging Face Hub? #455

Closed namp closed 1 year ago

namp commented 1 year ago

I was following this tutorial

https://neuml.hashnode.dev/embeddings-in-the-cloud

and I'm not quite sure how to extract the actual word embedding vectors after this line of code:

embeddings = Embeddings() embeddings.load(provider="huggingface-hub", container="neuml/txtai-intro")

I need the actual word vectors for another task that I'd like to run in parallel.

Is it even possible?

Thanks

davidmezzetti commented 1 year ago

The embeddings index loaded from the Hugging Face Hub is the same format as other txtai indexes. The vectors are stored in Faiss. While it's possible to reconstruct the embeddings from Faiss, most of the time it's easier to call transform/batchtransform.

For example, the following snippet returns a vector per the embeddings settings.

embeddings.transform((None, "text", None))

or with a batch of text

embeddings.batchtransform([(None, "text1", None), (None, "text2", None)])

davidmezzetti commented 1 year ago

Closing due to inactivity. Re-open or open a new issue if there are further questions.