[Bug]: BM25Retriever cannot work on chinese

lifu963 commented 1 month ago

Bug Description

BM25Retriever cannot work on chinese.

Version

main

Steps to Reproduce

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.response.notebook_utils import display_source_node

documents = [Document(text="床前明月光"),
             Document(text="疑是地上霜"),
             Document(text="举头望明月"),
             Document(text="低头思故乡")]

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)
retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=2)

nodes = retriever.retrieve("故乡")
for node in nodes:
    display_source_node(node)

Relevant Logs/Tracbacks

output:

Node ID: d3d59a82-e489-47da-8229-a430c7f58c00 Similarity: 0.0 Text: 床前明月光
Node ID: 6105e407-408e-48a5-af44-a82515babb56 Similarity: 0.0 Text: 疑是地上霜
Node ID: fd059a68-d496-4a61-847f-c96e108a69b2 Similarity: 0.0 Text: 举头望明月
Node ID: f19761bc-d381-4679-bc77-f32cbad57db5 Similarity: 0.0 Text: 低头思故乡

dosubot[bot] commented 1 month ago

@lifu963 你好! 我是 Dosu。作为一个机器人，我可以帮你解决bug，回答问题，还可以帮你成为贡献者。在等待人类维护者的时候，我会全程帮助你。

To resolve the issue with the BM25Retriever not correctly processing Chinese text, you need to modify the tokenizer to handle Chinese text appropriately. The current tokenizer is designed for English text, which is not suitable for Chinese.

Here is how you can modify the tokenizer to handle Chinese text:

Install the jieba library, which is a popular Chinese text segmentation library:
```
pip install jieba
```

Define a new tokenizer function for Chinese text:

import jieba
from typing import List

def chinese_tokenizer(text: str) -> List[str]:
   # Use jieba to segment Chinese text
   return list(jieba.cut(text))

Use this new tokenizer when initializing the BM25Retriever:

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.response.notebook_utils import display_source_node

documents = [Document(text="床前明月光"),
            Document(text="疑是地上霜"),
            Document(text="举头望明月"),
            Document(text="低头思故乡")]

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

retriever = BM25Retriever.from_defaults(
   nodes=nodes,
   similarity_top_k=2,
   tokenizer=chinese_tokenizer
)

nodes = retriever.retrieve("故乡")
for node in nodes:
   display_source_node(node)

By using chinese_tokenizer, the BM25Retriever will be able to correctly process Chinese text and compute meaningful similarity scores ^[1]^[2].

To continue talking to Dosu, mention @dosu.

kun432 commented 3 weeks ago

I think tokenizer option should be mentioned on the document for non-english.

lifu963 commented 3 weeks ago

@kun432

Do you think it is necessary to add the following code, which can effectively implement a tokenizer capable of segmenting both Chinese and English keywords：

import re
from typing import List, Any

import jieba
import snowballstemmer

CHINESE_CHAR_RE = re.compile(r'[\u4e00-\u9fff]')
STEMMER = snowballstemmer.stemmer('english')

WORDS_TO_IGNORE = [
    '', '\\t', '\\n', '\\\\', '\\', '', '\n', '\t', '\\', ' ', ',', '，', ';', '；', '/', '.', '。', '-', 'is', 'are',
    'am', 'what', 'how', '的', '吗', '是', '了', '啊', '呢', '怎么', '如何', '什么', '(', ')', '（', '）', '【', '】', '[', ']', '{',
    '}', '？', '?', '！', '!', '“', '”', '‘', '’', "'", "'", '"', '"', ':', '：', '讲了', '描述', '讲', '总结', 'summarize',
    '总结下', '总结一下', '文档', '文章', 'article', 'paper', '文稿', '稿子', '论文', 'PDF', 'pdf', '这个', '这篇', '这', '我', '帮我', '那个',
    '下', '翻译', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll",
    "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers',
    'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who',
    'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
    'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because',
    'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during',
    'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again',
    'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few',
    'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
    's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y',
    'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
    "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
    'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn',
    "wouldn't", '说说', '讲讲', '介绍', 'summary'
]

def has_chinese_chars(data: Any) -> bool:
    text = f'{data}'
    return bool(CHINESE_CHAR_RE.search(text))

def string_tokenizer(text: str) -> List[str]:
    text = text.lower()
    if has_chinese_chars(text):
        _wordlist = list(jieba.lcut(text.strip()))
    else:
        _wordlist = text.strip().split()
    return STEMMER.stemWords(_wordlist)

def zh_tokenizer(text: str) -> List[str]:
    _wordlist = string_tokenizer(text)
    wordlist = []
    for x in _wordlist:
        if x in WORDS_TO_IGNORE:
            continue
        wordlist.append(x)
    return wordlist

Perhaps this tokenizer can replace the default tokenizer: tokenize_remove_stopwords of BM25Retriever:

# The following code is from: /llama_index/retrievers/bm25/base.py
...
def tokenize_remove_stopwords(text: str) -> List[str]:
    # lowercase and stem words
    text = text.lower()
    stemmer = PorterStemmer()
    words = list(simple_extract_keywords(text))
    return [stemmer.stem(word) for word in words]

class BM25Retriever(BaseRetriever):
    def __init__(
        self,
        nodes: List[BaseNode],
        tokenizer: Optional[Callable[[str], List[str]]],
        similarity_top_k: int = DEFAULT_SIMILARITY_TOP_K,
        callback_manager: Optional[CallbackManager] = None,
        objects: Optional[List[IndexNode]] = None,
        object_map: Optional[dict] = None,
        verbose: bool = False,
    ) -> None:
        self._nodes = nodes
        self._tokenizer = tokenizer or tokenize_remove_stopwords
...

kun432 commented 1 week ago

@lifu963

not sure about your case and tokenization will be different between languages, but IMO, tokenizer function itself does simply:

pass a string into that function
tokenize it in some way
return a list of tokens

so you can/may need to implement any processes you want/need, such as tokenization, stemming or lemmatization, and removing stopword, etc, in your tokenizer function based on the language you use, I think.

run-llama / llama_index