Open zfxSteven opened 3 years ago
sent2vec uses "distilbert-base-uncased" as default model. For other languages you need to use the "bert-base-multilingual-cased" model. You can find the documentation here:
https://huggingface.co/bert-base-multilingual-cased
Sorry for the bad translation (I used Google Translate) but this is how you can apply sent2vec to another language
sentences = [
"这是一本学习 NLP 的好书",
"DistilBERT 是一个了不起的 NLP 模型",
"我们可以交替使用嵌入、编码或矢量化。",
]
vectorizer = Vectorizer()
vectorizer.bert(sentences, pretrained_weights='bert-base-multilingual-cased')
vectors = vectorizer.vectors
from scipy import spatial
dist_1 = spatial.distance.cosine(vectors[0], vectors[1])
dist_2 = spatial.distance.cosine(vectors[0], vectors[2])
print('dist_1: {0}, dist_2: {1}'.format(dist_1, dist_2))
assert dist_1 < dist_2
That returns the following result:
dist_1: 0.019039809703826904, dist_2: 0.029676854610443115
I hope this helps.
sent2vec is a wrapper around Bert and Word2Vec models. So, as long as the original model supports Chinese, sent2vec works accordingly.