How to create a custom stemmer？

xieyezi commented 1 year ago

Awesome project!

Is there any tutorial or any guide that can help me create a custom stemmer？ I wanna create stemmer for Chinese. I will initiate a pull request after I finish it.

micheleriva commented 1 year ago

Hi @xieyezi! Chinese can be quite challenging (especially for a non-native speaker like me), so I'd really appreciate some help.

In the context of Lyra, you can always provide a custom stemmer while initializing a new instance:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: {
    foo: 'string'
  },
  tokenizer: {
    stemmingFn: (word: string): string => {
      // custom stemming function
    }
  }
})

Taken from the Redis docs (we use the same stemming algorithm as Redis Search):

Indexing a Chinese document is different than indexing a document in most other languages because of how tokens are extracted. While most languages can have their tokens distinguished by separation characters and whitespace, this is not common in Chinese.

Chinese tokenization is done by scanning the input text and checking every character or sequence of characters against a dictionary of predefined terms and determining the most likely (based on the surrounding terms and characters) match.

RediSearch makes use of the Friso chinese tokenization library for this purpose. This is largely transparent to the user and often no additional configuration is required.

(source: https://redis.io/docs/stack/search/reference/stemming/)

We could try to understand if there's an open-source library that could help us tokenizing, stemming (and maybe removing stop-words) in Chinese.

xieyezi commented 1 year ago

@micheleriva thanks reply！ Actually, Chinese is indeed challenging! I find some open-source library like chinese-tokenizer and I have try it, but I find that is not sutiable Lyra custom stemmer config.

Insert

When we insert data into db, we can:

const KEY_PINYIN = keyWordToPinYin(key);
insert(db, { KEY_PINYIN, ...data });

keyWordToPinYin is a function transform Chinese to PinYin.

Search

And when need search, we can transform keyword to PinYin, and pass to search function:

const KEY_PINYIN = keyWordToPinYin(keyWord);
const res = searchResult(db, { term: KEY_PINYIN, properties: "*" });

By this way, we can search Chinese by KEY_PINYIN. But they are not perfect. In fact, Chinese is a ancient language, In Chinese, one word maybe has many meanings, for example: 例子 can be transform to lizi, But 粒子 also can be transform to lizi. It's not exact match. But anyway, I will keep trying or find a better solution.

Thanks again!

micheleriva commented 1 year ago

Closing this as we're pursuing other solutions for custom stemmers.

oramasearch / orama

How to create a custom stemmer？ #170

Other Solution

Insert

Search