leeoniya / uFuzzy

A tiny, efficient fuzzy search that doesn't suck
MIT License
2.59k stars 47 forks source link

How well does uFuzzy support CJK, stopwords, stemmers? #68

Closed Jieiku closed 1 month ago

Jieiku commented 1 month ago

Recently Zola static site generator got the ability to output the search index into a json format that is compatible with Fuse.js because it is a json format I was thinking it would likely also be compatible with uFuzzy.

https://github.com/getzola/zola/pull/2507

https://www.getzola.org/documentation/content/search/#fuse

There is a discussion about adding additional searches to Zola here: https://github.com/getzola/zola/issues/1849

I am planning to try out any search libraries that look promising that will accept a json based index as input, but ones that support CJK, stopwords, stemmers are a plus!

Currently in the Abridge theme for Zola I support elasticlunr as the default and it handles other languages by loading additional js files for those languages as needed, you can see them all in this directory starting with lunr.languagecode:

https://github.com/Jieiku/abridge/tree/master/static/js

leeoniya commented 1 month ago

uFuzzy is a clever regexp compiler, not a fulltext search engine. it does not do any kind of processing of the haystack or needle, so any kind of stopword removal and stemming have to be done outside of uFuzzy.

uFuzzy supports CJK by using unicode regexps and supports diacritics by providing a util function to strip them (uFuzzy.latinize()).

Jieiku commented 1 month ago

Thank you, that very clearly explains what I was wanting to know! Nevertheless your benchmarks and readme page are still very useful and I may just find a use for ufuzzy someplace else in the future, Thanks!