milvus-io / milvus-model

The embedding/reranking model zoo help user to convert their unstructured data into embeddings
Apache License 2.0
22 stars 17 forks source link

[nltk_data] Error loading stopwords: <urlopen error [Errno 11004] #34

Closed yidasanqian closed 2 months ago

yidasanqian commented 2 months ago

code:

from milvus_model.sparse.bm25.tokenizers import build_default_analyzer
from milvus_model.sparse import BM25EmbeddingFunction

analyzer = build_default_analyzer(language="zh")
bm25_ef = BM25EmbeddingFunction(analyzer)

error:

[nltk_data] Error loading stopwords: <urlopen error [Errno 11004]
[nltk_data]     getaddrinfo failed>

How to solve it?

wxywb commented 2 months ago

Based on google, I think this issue is related to proxy settings: https://stackoverflow.com/questions/45573833/error-in-downloading-nltk-data-errno-11004-getaddrinfo-failed

yidasanqian commented 2 months ago

@wxywb Where is the downloaded file located? Can I manually specify a directory?

wxywb commented 2 months ago

There is due to network conditions, you can search based on your environment. https://blog.csdn.net/qq_63385279/article/details/136220118

yidasanqian commented 2 months ago

If my corpus is a mix of Chinese and English, and I specify that the analyzer is zh, will it fit properly?

wxywb commented 2 months ago

In this scenario, using the Jieba tokenizer to break your sentences into English and Chinese tokens would result in inferior performance compared to using an English tokenizer. This is because the English tokenizer employs stemming algorithms to match different variants of a word. For better performance, consider using BGE-M3 or a customized tokenizer that applies a stemming algorithm to English words.