keymanapp / api.keyman.com

https://api.keyman.com/ source
3 stars 3 forks source link

TODO list for keyboard search #87

Closed mcdurdin closed 4 years ago

mcdurdin commented 4 years ago

FUTURE:

TODO:

===========================================================================================================================

DONE:

  1. Additional Notes: Notes on api.keyman.com changes for langtag consumption

    • example keyboard: burushaski_girminas, khw-latn "Khowar (Latin)"

    • ខ្មែរ finds zero results but ខ្មែ finds 7...

    • Show a 'popular keyboards' list for the empty search -- this can also be the search engine jumping-off point.

    • "Show obsolete keyboards" needs an indication of the change of status ("Hide obsolete keyboards") and needs to be outdented. Also needs some thought with paginated results.

    • Too many pages leads to overwhelming number of page links at bottom (e.g. s:latin)

    • http://api.keyman.com.local/search?q=l gives 500

    • el_dinka appears to have non-canonical bcp47 codes -- search finds it no trouble.

    • Show list of associated languages+scripts+countries in keyboard deatils (and related keyboards?)

    • For in-app download links, include information on searched language code (if available), for default language install (#1456)

    • Match fields in json should be integer or float where possible, not string! (and update schema accordingly)

    • schema for match type should be restrictive to actual types used

    • Search "spa" vs "spanish" -- the weighting could be better. Similar "ger" vs "german". (probably need length-based match weight override)

    • REFACTOR: region vs country

    • REFACTOR: code vs id vs tag

    • Pagination

    • Need to give more detail on failed links (and make it easier to find in logs, so tweak the broken link search a node wrapper)

    • Searches for keyboard ids should work

    • Phrases are not working yet (need to split into either a phrase search or separate words)

    • Searches for bcp47 tags, scripts, regions should work

      • need to highlight these on keyman.com (incl. keyboard_id)

FAIL: http://api.keyman.com.local/search/2.0?f=1&q=l:%, c:%, etc.

PHP Fatal error:  Uncaught PDOException: SQLSTATE[IMSSP]: The active result for the query contains no fields. in C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.inc.php:236
Stack trace:
#0 C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.inc.php(236): PDOStatement->fetchAll()
#1 C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.inc.php(77): KeyboardSearch->GetSearchQueries()
#2 C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.inc.php(73): KeyboardSearch->WriteSearchResults()
#3 C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.php(50): KeyboardSearch->GetSearchMatches()
#4 {main}
  thrown in C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.inc.php on line 236
  1. Default search should return a FLAT LIST of KEYBOARDS ONLY with highlights. e.g. 'Thai' should return keyboards with 'Thai' in the name, in a language name, or in the country associated with the language.

  2. Search results must be weighted (summed?) a) match of primary language name 1.0 b) match of alternate language name 0.3 c) match of keyboard name or id 1.0 d) match of script name 1.0 e) match of country name 0.5 f) match on term in description 0.5 g) match quality (whole word match = 1.0, down to 0.1 for further distance? as a multiplicand) select * from t_langtag_name inner join containstable(t_langtag_name, name, 'isabout (thai weight (1.0), "thai*" weight (0.5))') as KEY_TBL ON t_langtag_name._id = KEY_TBL.[KEY] order by [RaNK] desc 5 / 5 = 1.0 4 / 5 = 0.8 1 / 5 = 0.2 NOTE: final weighting is different but ... let's see how it goes

  3. Can also specify a search: ?q=l:<term> search for keyboards that support a language, by name (does not check id) ?q=l:id:<id> search for keyboards that support a language, by bcp 47 id ?q=c:<term> search for keyboards that support languages within a country ?q=c::id:<id> search for keyboards that support languages within a country, by iso 3166 id ?q=s:<term> search for keyboards that support a script ?q=s:id:<id> search for keyboards that support a script by script id ?q=id:<id> search for keyboards that match the id ?q=legacy:<id> search for keyboards that match the legacy id, only one returned!

  4. Should be able to specify alternate names? Searches should match on NFKD with diacritics stripped.

Process:

  1. Add langtags.json to website database so it is available for rewrite. Don't touch APIs at this stage. Deploy this.
  2. Rewrite search.php to support the details above, basing off langtags.json.
mcdurdin commented 4 years ago

Also: review feedback from @ermshiperete at https://community.software.sil.org/t/looking-for-testers-keyboard-search-refresh/3570/8

mcdurdin commented 4 years ago

TODO:

darcywong00 commented 4 years ago
mcdurdin commented 4 years ago
mcdurdin commented 4 years ago

@ermshiperete comments:

Further comments from @ermshiperete:

Just found a bug:

Further comments from @ermshiperete:

mcdurdin commented 4 years ago

I was trying to query for a recently added sil_nko keyboard for the N’Ko language

My query l:n’ko gives a list of 7 Keyboards for languages matching ‘n’ko’ but sil_nko isn’t one of them.

The keyboard page https://keyman-staging.com/keyboards/sil_nko

does list N’Ko (l:id:nqo) as one of the supported languages

mcdurdin commented 4 years ago
  • The search currently only finds languages that start with the search term. Previously it also listed languages that contained that term. Searching for "German" now shows all keyboards for the German language, but not "German, Pennsylvania" that it showed previously. Searching for "Pennsylvania Dutch" shows the expected results, but searching just for "Dutch" shows only keyboards for Dutch, but not Pennsylvania Dutch.

This is by design. There is only one keyboard currently listed that supports those languages: sil_euro_latin. However, because it also supports Dutch and German, searching for those terms finds the shorter matching language names first. Because we don't do a nested search now, just a keyboard search, these types of changes in results are to be expected.

  • Searching for "Amish" (that previously showed language "German, Pennsylvania (Amish Pennsylvania German)") has 0 results.

langtags.json does not list Amish Pennsylvania German as an alternate language name for Pennsylvania German. If this is a problem, it should be fixed in langtags.json.

  • It's not possible to search for language code (unless you use l:id:_code_)

This is an advanced feature and is by design. There is now a hint to help you search by language code on the search page.

  • Searching for usa shows keyboards for Usakade language, but not keyboards used in the USA.

  • Searching for c:id:usa has 0 results (don't they use keyboards anymore? :-) ). Ah, I see. It expects the two-letter codes from ISO 3166-2, not the three letter codes: "c:id:us"

  • Searching for c:usa has 0 results. Has to use c:united states (and if I'm slower to type than the search to find results then I can't type that because it strips the space)

Correct. We don't currently support synonyms or abbreviations for country names. This would be a low priority feature I think; I don't want to maintain a database of synonyms for countries and the ISO 3166-1 list does not include them. We use the ISO 3166-1 alpha-1 list, which is the most common format.

  • Searching for "swa" shows keyboards for several languages that start with "swa". However, it doesn't show "EuroLatin (SIL)" for Swabian; Swabian keyboards appear under obsolete keyboards. Searching for "swab" shows the EuroLatin one as well.

This is by design. "EuroLatin (SIL)" matches on "Swati" language rather than "Swabian", and the keyboard won't be shown twice. There are 13 different language names starting with "swa" in langtags.json and we don't want to show duplicates. Just keep typing if it hasn't found the name you are looking for 😉.

  • The display of results when searching for localized language name is awkward: "EuroLatin (SIL)(Deutsch language)". Putting the language first would be better: "EuroLatin (SIL)(language: Deutsch)"

I think this is mostly personal preference 😁.

  • Searching for l:id:ydd shows results for BCP 47 tag yi - which might be correct but is a bit surprising. l:id:yi shows same results (old search didn't find anything for l:id:yi). Searching for l:id:yih shows results for BCP tag yih.

This is correct. We normalise the BCP 47 language subtag from ISO639-3 to ISO639-1 (which gives us ydd->yi). yih does not have an ISO639-1 code.

  • Searching for yiddis shows "Yiddish Pasekh". Searching for yiddish shows "Yiddish Pasekh (Yiddish language)". Searching for yiddish p shows "Yiddish Pasekh" again.

This is a side-effect of the precise match signal, which pushes the exact string match of Yiddish language name into a higher weight. I don't think I'll try and improve it 😄.

  • The old search listed languages and countries related to the search term which I find helpful.

I also, in some ways, prefer the nested search results... But this was the trade-off I made at the start of the design. The old search had too much complexity due to the multiple search result lists and I think that this simpler flat search result matches what most users are going to expect (as they will be familiar with the flat Google-style searches).

This has been resolved in an earlier PR.

  • When searching for ipa, why does IPATotal show up first (with 82 monthly downloads) and IPA (SIL) only second? Especially since I did the search on Linux where IPATotal is not supported. I would have expected IPA (SIL) to show up first.

Okay, so this is actually a bit of a tricky one.

For the embedded search, IPATotal would not show up. For the basic web search, we don't use the current user's platform as a signal, currently. The unexpected ordering here comes about because we are multiplying the match weight against the ln() of the download count (+2 for reasons).

IPATotal currently wins out because its name starts with IPA as well as having IPA in the description, giving it a basic weight of 60 vs SIL IPA of 35.

The final weights are 286.24 and 225.48 respectively. We just need to download sil_ipa another 3000 times a month and it'll sort itself out 🙈. Perhaps that indicates that ln() is a little too strong. Maybe sqrt() is a better curve, making popularity a stronger signal?

And with sqrt(), we end up with final weights of 652 and 877 approx, respectively, so SIL IPA would win. But does this hurt other searches? What are our other options?

Changing this formula will break all my tests because all the weights change so I am really not very keen 🤣... but will do if this is a good solution. Thoughts appreciated.

mcdurdin commented 4 years ago

Finish keyboard install page (aka universal link infrastructure) for:

mcdurdin commented 4 years ago
darcywong00 commented 4 years ago

I think the staging site is using the BCP 47 tag und-fonipa for the sil_ipa keyboard, but the keyboard package metadata is using und-latn.

On Keyman for Android alpha, I do a keyboard search for "sil_ipa" and install the keyboard. The sil_ipa keyboard shows up with the tag und-Latn. From the app, I then do a keyboard search for l:id:und-latn and I get 0 results. (Shouldn't it have found sil_ipa?)

mcdurdin commented 4 years ago

Re und-fonipa and und-latn: this arises from a disconnect between the sil_ipa.keyboard_info and sil_ipa.kps language data:

sil_ipa.keyboard_info

    "languages": ["und-fonipa"],

sil_ipa.kps

      <Languages>
        <Language ID="und-Latn">und-Latn</Language>
      </Languages>

This was deliberate at the time, because we had trouble installing und-fonipa on some platforms. This will be resolved when we go to 14.0 release, so we should plan to update the SIL IPA keyboard to use und-fonipa in sil_ipa.kps as well.

mcdurdin commented 4 years ago

All remaining items extracted into separate issues, so closing this mega checklist