ispras / lingvodoc-react

Apache License 2.0
8 stars 12 forks source link

Implement search options "with" and "without" diacritics #415

Closed nadezhday closed 3 years ago

nadezhday commented 3 years ago

е = ё Search for: елка Expected: it would be nice to see "ёлка" within found entries.

An addition from Lena K.: maybe we should make a search mode where diacritics are not taken into account at all? Option “with "and” without" diacritics? I support you about the search, regardless of the diacritics! In FLEx, this is really very convenient! (V. Lemskaya)

nadezhday commented 3 years ago

Comment from Oleg: Unicode characters/look at the fieldworks code lists of diacritics and multicomponent characters. https://github.com/ispras/lingvodoc-react/blob/4e86fd787e6a512523eace566174e67a05b8b219/src/components/Search/AdditionalFilter/GrammarFilter/grammaticalSigns.json

myrix commented 3 years ago

I believe it can be done with additional gin_trgm_ops index on transformed Entity.content, with transformation consisting of deattaching diacritics via Unicode NFKD normal form transformation and then stripping them off via a regular expression.

Can't do that in our PostgreSQL version natively (I think Unicode normal forms are available from PostgreSQL 13?), so it would have to use a PL/Python function, with index on it like CREATE INDEX entity_content_xform_trgm_idx ON public.entity USING GIN (xform(content) gin_trgm_ops);

Of course, it would slow down entity insertion, but I think 1) entity insertion is not a bottleneck, we access entities and search them much more then create them and 2) it would be possible to make PL/Python function efficient enough so that slow down is little enough making new search functionality worth it.

EDIT: alternatively, if the slowdown would still be too much, we could create an auxiliary table with transformed entities' contents and create an index on it, updating it and using it any time a user requests diacritics-agnostic search.

myrix commented 3 years ago

Implemented.

Open /map_search (Tools -> Search), search for "cek" in field 'Word' as 'Sub string', 0 results found: image

Enable option 'Ignore diacritics', search again, 17 results found: image

yesandv commented 3 years ago

Verified.

myrix commented 1 year ago

Realized that 'Ignore diacritics' option was not working in AND mode, implemented it for AND mode too.

vmonakhov commented 1 year ago

Covered in tests/test_tools_search_03.py