kiwirafe / xiangsi

中文文本相似度计算器
MIT License
116 stars 21 forks source link

Is it right of output for the two words? #3

Closed opticaloptical closed 2 years ago

opticaloptical commented 2 years ago

import xiangshi as xs

l=xs.cossim(["抄袭", "克隆"]) print(l)

0.0

kiwirafe commented 2 years ago

相识主要适用于较长一点的文本。 相识计算出为0是因为背后的算法是先分词后向量化再计算,而因为两个词根本不一样导致结果也就为0。 相识不是像Word Similarity一样根据同义词来计算(不过这个地方可能会在以后版本中改进)。 具体余弦相似度算法可参照:https://zhuanlan.zhihu.com/p/43396514

kiwirafe commented 2 years ago

Xiangshi mainly applies to texts of a larger size. Xiangshi calculated 0 because the algorithm first segments the words and then calculates it, and because the two words are fundamentally different, the result is 0. Xiangshi does not calculate based on synonyms like Word Similarity (However, this part may be improved in future versions). Cosine Similarity algorithm: https://zhuanlan.zhihu.com/p/43396514