hezarai / hezar

The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
https://hezarai.github.io/hezar/
Apache License 2.0
834 stars 44 forks source link

Add text embedding models #127

Open arxyzan opened 9 months ago

arxyzan commented 9 months ago

Most text embedding models nowadays are based on sentence transformers. For this task, first we must figure out the general structure of such models and the way we can run inference on them. These family of models are located at hezar/models/text_emebedding and the main backend for the models is sentence-transformers.

iamomiid commented 8 months ago

For the time being, what's the best way to retrieve embedding for a sentence? (I was thinking of tokenising the whole sentence, retrieving embedding of each token and then getting mean over embeddings of all tokens but I'm not sure if it's the best or it even makes sense)

arxyzan commented 7 months ago

Hi @iamomiid , I'm so sorry for the late response. I've been quite busy recently. I don't know if your suggested method would work well compared to sentence embeddings. Right now, the best option is to use sentence-transformers package and use one of the multilingual models that support Persian.

MortezaMahdaviMortazavi commented 3 months ago

@iamomiid Hi omid.In the sentence-transformers library,most of the models use mean average over the embedding of tokens to give the sentence embedding.In compare to start token and end of token that we can also use (because all tokens embedded on these two token also) but tests show that mean average is the best choice.Also in sentence-transformers some models use the cls token (start token) to show the embed of sentence