Open arxyzan opened 11 months ago
For the time being, what's the best way to retrieve embedding for a sentence? (I was thinking of tokenising the whole sentence, retrieving embedding of each token and then getting mean
over embeddings of all tokens but I'm not sure if it's the best or it even makes sense)
Hi @iamomiid , I'm so sorry for the late response. I've been quite busy recently.
I don't know if your suggested method would work well compared to sentence embeddings.
Right now, the best option is to use sentence-transformers
package and use one of the multilingual models that support Persian.
@iamomiid Hi omid.In the sentence-transformers library,most of the models use mean average over the embedding of tokens to give the sentence embedding.In compare to start token and end of token that we can also use (because all tokens embedded on these two token also) but tests show that mean average is the best choice.Also in sentence-transformers some models use the cls token (start token) to show the embed of sentence
Most text embedding models nowadays are based on sentence transformers. For this task, first we must figure out the general structure of such models and the way we can run inference on them. These family of models are located at
hezar/models/text_emebedding
and the main backend for the models issentence-transformers
.