lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.64k stars 133 forks source link

bm25 and cross-language searching #260

Closed imdoge closed 3 months ago

imdoge commented 3 months ago

I noticed that MiniSearch has implemented a JavaScript version of BM25. I'm wondering why MiniSearch does not support cross-language searching. Recently, I have been using Python to debug RAG-related applications, such as llamaIndex and LangChain. These libraries' BM25 searches can perform cross-language searching.

However, I am looking for a JavaScript version of BM25 search and found MiniSearch, which is an excellent library, but it doesn't support cross-language searching. Could you explain why this is the case?

P.S For example: If the data is "bike," searching for "vélo."

thanks~

rolftimmermans commented 3 months ago

I'll try to answer this given that I opened the original BM25/BM25+ pull request for MiniSearch.

MiniSearch searches in approximately two stages: matching and ranking. (This is a bit of a simplification; for this explanation I will ignore features like filtering and boosting).

The first step is matching. MiniSearch implements a fuzzy search algorithm that looks for words that are textually similar to the words in the query. All documents that match the query in some way are collected.

The second step is ranking. The goal is to show the matching documents in order of relevance; which documents match best? BM25 and BM25+ are ranking algorithms. They do not generate search results, they only (re-)order them.

Cross-language searching (finding "bike" when you search for "vélo") needs to happen during the matching phase. Unless you provide your own translations, this is not something MiniSearch can do. MiniSearch provides fuzzy text-based matching, but the strings "bike" and "vélo" are not similar and will not match.

You could:

lucaong commented 3 months ago

I do not have much else to add to @rolftimmermans 's great answer.

@imdoge can I close the issue, or do you have further questions?

imdoge commented 3 months ago

No further questions, thank you for the answer.