Closed nadezhday closed 3 years ago
Comment from Oleg: Unicode characters/look at the fieldworks code lists of diacritics and multicomponent characters. https://github.com/ispras/lingvodoc-react/blob/4e86fd787e6a512523eace566174e67a05b8b219/src/components/Search/AdditionalFilter/GrammarFilter/grammaticalSigns.json
I believe it can be done with additional gin_trgm_ops index on transformed Entity.content, with transformation consisting of deattaching diacritics via Unicode NFKD normal form transformation and then stripping them off via a regular expression.
Can't do that in our PostgreSQL version natively (I think Unicode normal forms are available from PostgreSQL 13?), so it would have to use a PL/Python function, with index on it like
CREATE INDEX entity_content_xform_trgm_idx ON public.entity USING GIN (xform(content) gin_trgm_ops);
Of course, it would slow down entity insertion, but I think 1) entity insertion is not a bottleneck, we access entities and search them much more then create them and 2) it would be possible to make PL/Python function efficient enough so that slow down is little enough making new search functionality worth it.
EDIT: alternatively, if the slowdown would still be too much, we could create an auxiliary table with transformed entities' contents and create an index on it, updating it and using it any time a user requests diacritics-agnostic search.
Implemented.
Open /map_search (Tools -> Search), search for "cek" in field 'Word' as 'Sub string', 0 results found:
Enable option 'Ignore diacritics', search again, 17 results found:
Verified.
Realized that 'Ignore diacritics' option was not working in AND mode, implemented it for AND mode too.
Covered in tests/test_tools_search_03.py
е = ё Search for: елка Expected: it would be nice to see "ёлка" within found entries.
An addition from Lena K.: maybe we should make a search mode where diacritics are not taken into account at all? Option “with "and” without" diacritics? I support you about the search, regardless of the diacritics! In FLEx, this is really very convenient! (V. Lemskaya)