Closed onnotasler closed 1 year ago
This was (finally) fixed last year. Has there been a regression? Can you give examples which don't work as you expect them to? For the example that you gave, both https://openlibrary.org/search/authors?q=Cura%C3%A7ao and https://openlibrary.org/search/authors?q=Curacao return the same results.
BTW, the string of issues that reported this actually begin with #11 and #178, so it's been a long known problem (but one that's fixed now, as far as I'm aware).
It does work for "authors" and "inside". It does not work for "books" and "subjects".
That's an important detail. I'd suggest splitting this into two separate issues, one for each problem that needs to be fixed. Unfortunately, it looks like fixing the titles is going to require a reindex because they were given the wrong type in the schema (text_en_splitting
vs text_international
). https://github.com/internetarchive/openlibrary/blob/9c522cbe9814bbbc356cfa7e74134951cd661108/conf/solr/conf/managed-schema#L143-L147
Note that subjects are a complete mess and doing diacritic folding during search won't fix the bulk of the problem. Users who search for Geschichte or Histoire aren't going to find the bulk of the History books that OpenLibrary has cataloged. Subjects need to become first class objects (#2819) for that to improve.
Actually, neither text_en_splitting
(which does English-only stemming, but no diacritic folding) nor text_international
(which does diacritic folding, but no stemming), are likely to be appropriate for work titles where you want both diacritic folding and language-specific stemming/lemmatization.
I have created a draft pull request that handles this, I'm going to kick off a full reindex for our testing.openlibrary.org which should be ready for testing tomorrow.
It makes all text_en_splitting
fields perform ICUFolding. This will affect a number of fields, title/subtitle, but also subject/etc. This might have some funny side effects, but I think it might be better than what we currently have. And we can always make changes in the future!
Ok, diacritic insensitive search is now ready for full testing on testing.openlibrary.org ! You can use https://codepen.io/cdrini/full/wvJqzaK to see results side-by-side. Or just use testing.openlibrary.org directly. Would love some help testing the various types of search, title search, and search in different languages!
OpenLibrary is meant to be multilingual, which means it has books in very many different languages. Many of these languages have symbols that look identical (homoglyph) or at least very similar to a human, but are stored very differently.
Example: The island of Curaçao or the Turkish word Hayır.
For a (non-linguist) human, it is the ç is just a c with some kind of tail, and the ı is just an i that lost it's dot. For a computer, ç / c or ı / i are completely separate entities. Other examples are the German umlauts, ä, ö and ü (which for non-German speakers are simply a, o and u with silly dots, and for German-speakers are short-hand for ae, oe, and ue) or the French accent aigu (é), grave (è, ù, à), and circonflexe (â, ê, î, ô). Basically this affects anyone who uses glyphs that do not belong to the basic 128 US-ASCII glyphs.
I will thus tend to enter the perceived "base sign" instead of the correct one. You also find many imported entries where the diacritics have been replaced by similar looking Latin signs (take OL35438090M for example: It was imported as Batida Yeni Bir Sey Yok, while the correct form would be Batıda Yeni Bir Şey Yok).
Describe the problem that you'd like solved
Ideally, the search should treat letters the same way as a human would: A search for Curacao should also show results including Curaçao, unless I specifically tell the search not to do this. That would make it easier for humans to find the book they are looking for (and for librarians to find duplicate entries).
Proposal & Constraints
As far as I am aware, the Unicode collation algorithm offers a base for this kind of treatment.
I know that it can be solved though, as Firefox uses such a system for its internal search on pages: If I click CTRL-F in Firefox and then enter Curacao, it will highlight both Curaçao and Curacao.
Additional context
There have been similar search related issues before:
5310
6974
6059
714
Stakeholders
@cdrini