oramasearch / orama

🌌 A complete search engine and RAG pipeline in your browser, server or edge network with support for full-text, vector, and hybrid search in less than 2kb.
https://docs.orama.com
Other
8.64k stars 291 forks source link

How to create a custom stemmer? #170

Closed xieyezi closed 1 year ago

xieyezi commented 1 year ago

Awesome project!

Is there any tutorial or any guide that can help me create a custom stemmer? I wanna create stemmer for Chinese. I will initiate a pull request after I finish it.

micheleriva commented 1 year ago

Hi @xieyezi! Chinese can be quite challenging (especially for a non-native speaker like me), so I'd really appreciate some help.

In the context of Lyra, you can always provide a custom stemmer while initializing a new instance:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: {
    foo: 'string'
  },
  tokenizer: {
    stemmingFn: (word: string): string => {
      // custom stemming function
    }
  }
})

Taken from the Redis docs (we use the same stemming algorithm as Redis Search):

Indexing a Chinese document is different than indexing a document in most other languages because of how tokens are extracted. While most languages can have their tokens distinguished by separation characters and whitespace, this is not common in Chinese.

Chinese tokenization is done by scanning the input text and checking every character or sequence of characters against a dictionary of predefined terms and determining the most likely (based on the surrounding terms and characters) match.

RediSearch makes use of the Friso chinese tokenization library for this purpose. This is largely transparent to the user and often no additional configuration is required.

(source: https://redis.io/docs/stack/search/reference/stemming/)

We could try to understand if there's an open-source library that could help us tokenizing, stemming (and maybe removing stop-words) in Chinese.

xieyezi commented 1 year ago

@micheleriva thanks reply! Actually, Chinese is indeed challenging! I find some open-source library like chinese-tokenizer and I have try it, but I find that is not sutiable Lyra custom stemmer config.

Other Solution

But I got other solution(althuogh it's not perfect). We can transform Chinese to PinYin(Pinyin is the Latinization of the English alphabet in Chinese ). For example, 看见 can be transform to kanjian, 谢谢 can be transform to xiexie , etc. By this way, We can add key to schema for Chinese key:

const db = create({
  schema: { KEY_PINYIN: "string", ...schema },
  defaultLanguage: "english"
});

Insert

When we insert data into db, we can:

const KEY_PINYIN = keyWordToPinYin(key);
insert(db, { KEY_PINYIN, ...data });

keyWordToPinYin is a function transform Chinese to PinYin.

Search

And when need search, we can transform keyword to PinYin, and pass to search function:

const KEY_PINYIN = keyWordToPinYin(keyWord);
const res = searchResult(db, { term: KEY_PINYIN, properties: "*" });

By this way, we can search Chinese by KEY_PINYIN. But they are not perfect. In fact, Chinese is a ancient language, In Chinese, one word maybe has many meanings, for example: 例子 can be transform to lizi, But 粒子 also can be transform to lizi. It's not exact match. But anyway, I will keep trying or find a better solution.

Thanks again!

micheleriva commented 1 year ago

Closing this as we're pursuing other solutions for custom stemmers.