milvus-io / milvus-model

The embedding/reranking model zoo help user to convert their unstructured data into embeddings
Apache License 2.0
22 stars 17 forks source link

Error occured in bm25_ef.fit(corpus) #22

Open rdyuan opened 6 months ago

rdyuan commented 6 months ago

这是我的全部代码: from milvus_model.sparse.bm25.tokenizers import build_default_analyzer from milvus_model.sparse import BM25EmbeddingFunction analyzer = build_default_analyzer(language="zh") corpus = [ "人工智能于1956年作为一门学科成立。", "艾伦·图灵是第一个对人工智能进行实质性研究的人。", "图灵出生在伦敦的梅达维尔,在英格兰南部长大。", ] bm25_ef = BM25EmbeddingFunction(analyzer) bm25_ef.fit(corpus) docs = [ "人工智能领域于1956年作为一门学术学科成立。", "艾伦·图灵是在人工智能领域进行重大研究的先驱。", "图灵出生在伦敦的梅达维尔,在英格兰南部地区长大。", "1956年,人工智能作为一个学术领域出现。", "图灵来自伦敦梅达维尔,在英格兰南部长大。" ] docs_embeddings = bm25_ef.encode_documents(docs) print("Embeddings:", docs_embeddings) print("Sparse dim:", bm25_ef.dim, list(docs_embeddings)[0].shape)

在执行到bm25_ef.fit(corpus)时发生报错如下: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 129, in _main main_content = runpy.run_path(main_path, main_content = runpy.run_path(main_path, ^^ ^prepare(preparation_data)^ ^^^^^^ ^ ^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 240, in prepare ^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen runpy>", line 291, in run_path File "<frozen runpy>", line 98, in _run_module_code File "<frozen runpy>", line 88, in _run_code

相关版本号: Python==3.11.3 milvus_model==0.2.2

xiaofan-luan commented 6 months ago

/assign @wxywb can you help on investigating it

wxywb commented 6 months ago

This code works in my environment. It may be related to some multiprocessing problems I need to delve into. You can try the following code.

from milvus_model.sparse.bm25.tokenizers import build_default_analyzer
from milvus_model.sparse import BM25EmbeddingFunction
analyzer = build_default_analyzer(language="zh")
corpus = [ "人工智能于1956年作为一门学科成立。", "艾伦·图灵是第一个对人工智能进行实质性研究的人。", "图灵出生在伦敦的梅达维尔,在英格兰南部长大。", ]
# this line will remove multi-processing 
bm25_ef = BM25EmbeddingFunction(analyzer, num_workers=1)
bm25_ef.fit(corpus)
docs = [ "人工智能领域于1956年作为一门学术学科成立。", "艾伦·图灵是在人工智能领域进行重大研究的先驱。", "图灵出生在伦敦的梅达维尔,在英格兰南部地区长大。", "1956年,人工智能作为一个学术领域出现。", "图>灵来自伦敦梅达维尔,在英格兰南部长大。" ]
docs_embeddings = bm25_ef.encode_documents(docs)
print("Embeddings:", docs_embeddings)
print("Sparse dim:", bm25_ef.dim, list(docs_embeddings)[0].shape)
wxywb commented 6 months ago

@rdyuan Could you give me full trace log? It seems just part of it.

rdyuan commented 6 months ago

@rdyuan Could you give me full trace log? It seems just part of it.

log.txt

rdyuan commented 6 months ago

@rdyuan Could you give me full trace log? It seems just part of it.

加了num_workers=1确实跑通了

abellee commented 5 months ago

这个问题还没解决吗?一到fit就开始死循环, num_workers=1是可以的

wxywb commented 5 months ago

这个问题还没解决吗?一到fit就开始死循环, num_workers=1是可以的

what operating system are you using?and please show me the code snippet abd error info.

abellee commented 5 months ago

这个问题还没解决吗?一到fit就开始死循环, num_workers=1是可以的

what operating system are you using?and please show me the code snippet abd error info.

just as the same problem as this issue. and os is Mac with Intel chip

wxywb commented 5 months ago

这个问题还没解决吗?一到fit就开始死循环, num_workers=1是可以的

what operating system are you using?and please show me the code snippet abd error info.

just as the same problem as this issue. and os is Mac with Intel chip

how about your python version?

abellee commented 5 months ago

这个问题还没解决吗?一到fit就开始死循环, num_workers=1是可以的

what operating system are you using?and please show me the code snippet abd error info.

just as the same problem as this issue. and os is Mac with Intel chip

how about your python version?

3.12