pdrm83 / sent2vec

How to encode sentences in a high-dimensional vector space, a.k.a., sentence embedding.
MIT License
133 stars 12 forks source link

does it support Chinese? #9

Open zfxSteven opened 3 years ago

pdrm83 commented 3 years ago

sent2vec is a wrapper around Bert and Word2Vec models. So, as long as the original model supports Chinese, sent2vec works accordingly.

almarengo commented 2 years ago

sent2vec uses "distilbert-base-uncased" as default model. For other languages you need to use the "bert-base-multilingual-cased" model. You can find the documentation here:

https://huggingface.co/bert-base-multilingual-cased

Sorry for the bad translation (I used Google Translate) but this is how you can apply sent2vec to another language

sentences = [
    "这是一本学习 NLP 的好书",
    "DistilBERT 是一个了不起的 NLP 模型",
    "我们可以交替使用嵌入、编码或矢量化。",
]

vectorizer = Vectorizer()
vectorizer.bert(sentences, pretrained_weights='bert-base-multilingual-cased')
vectors = vectorizer.vectors

from scipy import spatial

dist_1 = spatial.distance.cosine(vectors[0], vectors[1])
dist_2 = spatial.distance.cosine(vectors[0], vectors[2])
print('dist_1: {0}, dist_2: {1}'.format(dist_1, dist_2))
assert dist_1 < dist_2

That returns the following result:

dist_1: 0.019039809703826904, dist_2: 0.029676854610443115

I hope this helps.