pulibrary / allsearch_api

0 stars 0 forks source link

Test diacritics and other non-Basic-Latin character search for all services #289

Closed maxkadel closed 4 weeks ago

maxkadel commented 1 month ago

What maintenance needs to be done?

Level of urgency

Why is this maintenance needed?

Acceptance criteria

Implementation notes, if any

maxkadel commented 1 month ago

Examples from @tventimi:

https://allsearch.princeton.edu/?q=K%C5%8Dbuns%C5%8D%20Taika%20Koshomoku (works) vs https://allsearch.princeton.edu/?q=Kobunso+Taika+Koshomoku (doesn't work) vs https://allsearch.princeton.edu/?q=Ko%CC%84bunso%CC%84+taika+koshomoku (doesn't work) also https://allsearch.princeton.edu/?q=Chos%C5%8Fn+wangjo+sillok (works) vs https://allsearch.princeton.edu/?q=Choson+wangjo+sillok (doesn't work) vs https://allsearch.princeton.edu/?q=Choso%CC%86n+wangjo+sillok (doesn't work)

The DB entries themselves use the precomposed form of the accented characters. It appears that the matching fails when the search string uses the decomposed form or no diacritics at all. Blacklight is able to do this kind of normalization, which is why these searches have hits in the catalog but not the A-Z list. (edited)

kevinreiss commented 1 month ago

@sandbergja and I were talking a bit about this issue and Jane referenced this commit in OL that enabled matching to work for the precomposed, decomposed, and non-accented forms for the Browse lists: https://github.com/pulibrary/orangelight/pull/3894/files. We could potentially apply this to our local indexes in the allsearch to facilitate searching all forms.