Open lifu963 opened 1 month ago
@lifu963 你好! 我是 Dosu。作为一个机器人,我可以帮你解决bug,回答问题,还可以帮你成为贡献者。在等待人类维护者的时候,我会全程帮助你。
To resolve the issue with the BM25Retriever
not correctly processing Chinese text, you need to modify the tokenizer to handle Chinese text appropriately. The current tokenizer is designed for English text, which is not suitable for Chinese.
Here is how you can modify the tokenizer to handle Chinese text:
Install the jieba
library, which is a popular Chinese text segmentation library:
pip install jieba
Define a new tokenizer function for Chinese text:
import jieba
from typing import List
def chinese_tokenizer(text: str) -> List[str]:
# Use jieba to segment Chinese text
return list(jieba.cut(text))
Use this new tokenizer when initializing the BM25Retriever
:
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.response.notebook_utils import display_source_node
documents = [Document(text="床前明月光"),
Document(text="疑是地上霜"),
Document(text="举头望明月"),
Document(text="低头思故乡")]
splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)
retriever = BM25Retriever.from_defaults(
nodes=nodes,
similarity_top_k=2,
tokenizer=chinese_tokenizer
)
nodes = retriever.retrieve("故乡")
for node in nodes:
display_source_node(node)
By using chinese_tokenizer
, the BM25Retriever
will be able to correctly process Chinese text and compute meaningful similarity scores [1][2].
To continue talking to Dosu, mention @dosu.
I think tokenizer option should be mentioned on the document for non-english.
@kun432
Do you think it is necessary to add the following code, which can effectively implement a tokenizer capable of segmenting both Chinese and English keywords:
import re
from typing import List, Any
import jieba
import snowballstemmer
CHINESE_CHAR_RE = re.compile(r'[\u4e00-\u9fff]')
STEMMER = snowballstemmer.stemmer('english')
WORDS_TO_IGNORE = [
'', '\\t', '\\n', '\\\\', '\\', '', '\n', '\t', '\\', ' ', ',', ',', ';', ';', '/', '.', '。', '-', 'is', 'are',
'am', 'what', 'how', '的', '吗', '是', '了', '啊', '呢', '怎么', '如何', '什么', '(', ')', '(', ')', '【', '】', '[', ']', '{',
'}', '?', '?', '!', '!', '“', '”', '‘', '’', "'", "'", '"', '"', ':', ':', '讲了', '描述', '讲', '总结', 'summarize',
'总结下', '总结一下', '文档', '文章', 'article', 'paper', '文稿', '稿子', '论文', 'PDF', 'pdf', '这个', '这篇', '这', '我', '帮我', '那个',
'下', '翻译', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll",
"you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers',
'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who',
'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because',
'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during',
'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',
'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few',
'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y',
'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
"hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn',
"wouldn't", '说说', '讲讲', '介绍', 'summary'
]
def has_chinese_chars(data: Any) -> bool:
text = f'{data}'
return bool(CHINESE_CHAR_RE.search(text))
def string_tokenizer(text: str) -> List[str]:
text = text.lower()
if has_chinese_chars(text):
_wordlist = list(jieba.lcut(text.strip()))
else:
_wordlist = text.strip().split()
return STEMMER.stemWords(_wordlist)
def zh_tokenizer(text: str) -> List[str]:
_wordlist = string_tokenizer(text)
wordlist = []
for x in _wordlist:
if x in WORDS_TO_IGNORE:
continue
wordlist.append(x)
return wordlist
Perhaps this tokenizer can replace the default tokenizer: tokenize_remove_stopwords
of BM25Retriever:
# The following code is from: /llama_index/retrievers/bm25/base.py
...
def tokenize_remove_stopwords(text: str) -> List[str]:
# lowercase and stem words
text = text.lower()
stemmer = PorterStemmer()
words = list(simple_extract_keywords(text))
return [stemmer.stem(word) for word in words]
class BM25Retriever(BaseRetriever):
def __init__(
self,
nodes: List[BaseNode],
tokenizer: Optional[Callable[[str], List[str]]],
similarity_top_k: int = DEFAULT_SIMILARITY_TOP_K,
callback_manager: Optional[CallbackManager] = None,
objects: Optional[List[IndexNode]] = None,
object_map: Optional[dict] = None,
verbose: bool = False,
) -> None:
self._nodes = nodes
self._tokenizer = tokenizer or tokenize_remove_stopwords
...
@lifu963
not sure about your case and tokenization will be different between languages, but IMO, tokenizer function itself does simply:
so you can/may need to implement any processes you want/need, such as tokenization, stemming or lemmatization, and removing stopword, etc, in your tokenizer function based on the language you use, I think.
Bug Description
BM25Retriever cannot work on chinese.
Version
main
Steps to Reproduce
Relevant Logs/Tracbacks