sillsdev / languageforge-lexbox

Lexbox, SIL linguistic data hub
MIT License
7 stars 2 forks source link

Optimize entry filtering based on search script #1054

Open hahn-kev opened 2 months ago

hahn-kev commented 2 months ago

Right now when filtering entries we just match against the fields we care about. Eg lexeme form, citation form, gloss.

However if the vernacular WS is Thai, and the search script is only latin characters, then we could just skip searching fields that contain Thai based on their writing system. The reverse is also true, if the search text is in the Thai script, then we can skip searching any English text fields.

megahirt commented 2 months ago

This sounds like a good idea. We need to be careful though that we don't accidentally leave out results that people would expect to find. It is possible to have other writing system/language "runs" inside another multitext writing system. This is the complexity of the LCM multitext model.

In general, other language runs inside of a different writing system field are uncommon. But there are a few projects that have that as an example. We had to deal with that in Language Forge as well.

On Fri, Sep 13, 2024, 11:03 PM Kevin Hahn @.***> wrote:

Right now when filtering entries we just match against the fields we care about. Eg lexeme form, citation form, gloss.

However if the vernacular WS is Thai, and the search script is only latin characters, then we could just skip searching fields that contain Thai based on their writing system. The reverse is also true, if the search text is in the Thai script, then we can skip searching any English text fields.

— Reply to this email directly, view it on GitHub https://github.com/sillsdev/languageforge-lexbox/issues/1054, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2I6KL7UENM5MWHHEIXU6DZWP3WLAVCNFSM6AAAAABOGVOMMGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGUZDMMJSGM3DENA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rmunn commented 2 months ago

If we have a way to display results progressively, i.e. with Svelte stores or Svelte 5 $state variables, then it might be possible to get the best of both worlds: fast results that only look at fields that match the writing system searched for, plus a second slower query that searches all fields in case there are runs of other-language text, such as an English-langauge note containing the text "the word ไก่ is one of the first words that Thai kids learn in school."

The second search results would need to be merged in a way that removes duplicate results, of course. Or, wait, if the second search is set up to search only fields whose writing systems do NOT match the writing system of the input, then it would be guaranteed not to return duplicate results. (There could be multiple matches within one lexical entry, but the user probably wants to see that the text appears in multiple fields within that entry, as that's useful information in the search results).