internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.2k stars 1.36k forks source link

Search should be aware of typical "diacritic replacement characters" #7040

Closed onnotasler closed 1 year ago

onnotasler commented 2 years ago

OpenLibrary is meant to be multilingual, which means it has books in very many different languages. Many of these languages have symbols that look identical (homoglyph) or at least very similar to a human, but are stored very differently.

Example: The island of Curaçao or the Turkish word Hayır.

For a (non-linguist) human, it is the ç is just a c with some kind of tail, and the ı is just an i that lost it's dot. For a computer, ç / c or ı / i are completely separate entities. Other examples are the German umlauts, ä, ö and ü (which for non-German speakers are simply a, o and u with silly dots, and for German-speakers are short-hand for ae, oe, and ue) or the French accent aigu (é), grave (è, ù, à), and circonflexe (â, ê, î, ô). Basically this affects anyone who uses glyphs that do not belong to the basic 128 US-ASCII glyphs.

I will thus tend to enter the perceived "base sign" instead of the correct one. You also find many imported entries where the diacritics have been replaced by similar looking Latin signs (take OL35438090M for example: It was imported as Batida Yeni Bir Sey Yok, while the correct form would be Batıda Yeni Bir Şey Yok).

Describe the problem that you'd like solved

Ideally, the search should treat letters the same way as a human would: A search for Curacao should also show results including Curaçao, unless I specifically tell the search not to do this. That would make it easier for humans to find the book they are looking for (and for librarians to find duplicate entries).

Proposal & Constraints

As far as I am aware, the Unicode collation algorithm offers a base for this kind of treatment.

I know that it can be solved though, as Firefox uses such a system for its internal search on pages: If I click CTRL-F in Firefox and then enter Curacao, it will highlight both Curaçao and Curacao.

Additional context

There have been similar search related issues before:

5310

6974

6059

714

Stakeholders

@cdrini

tfmorris commented 2 years ago

This was (finally) fixed last year. Has there been a regression? Can you give examples which don't work as you expect them to? For the example that you gave, both https://openlibrary.org/search/authors?q=Cura%C3%A7ao and https://openlibrary.org/search/authors?q=Curacao return the same results.

BTW, the string of issues that reported this actually begin with #11 and #178, so it's been a long known problem (but one that's fixed now, as far as I'm aware).

onnotasler commented 2 years ago

It does work for "authors" and "inside". It does not work for "books" and "subjects".

Screenshot_20221006_183639

Screenshot_20221006_183614

tfmorris commented 2 years ago

That's an important detail. I'd suggest splitting this into two separate issues, one for each problem that needs to be fixed. Unfortunately, it looks like fixing the titles is going to require a reindex because they were given the wrong type in the schema (text_en_splitting vs text_international). https://github.com/internetarchive/openlibrary/blob/9c522cbe9814bbbc356cfa7e74134951cd661108/conf/solr/conf/managed-schema#L143-L147

Note that subjects are a complete mess and doing diacritic folding during search won't fix the bulk of the problem. Users who search for Geschichte or Histoire aren't going to find the bulk of the History books that OpenLibrary has cataloged. Subjects need to become first class objects (#2819) for that to improve.

tfmorris commented 2 years ago

Actually, neither text_en_splitting (which does English-only stemming, but no diacritic folding) nor text_international (which does diacritic folding, but no stemming), are likely to be appropriate for work titles where you want both diacritic folding and language-specific stemming/lemmatization.

cdrini commented 1 year ago

I have created a draft pull request that handles this, I'm going to kick off a full reindex for our testing.openlibrary.org which should be ready for testing tomorrow.

It makes all text_en_splitting fields perform ICUFolding. This will affect a number of fields, title/subtitle, but also subject/etc. This might have some funny side effects, but I think it might be better than what we currently have. And we can always make changes in the future!

cdrini commented 1 year ago

Ok, diacritic insensitive search is now ready for full testing on testing.openlibrary.org ! You can use https://codepen.io/cdrini/full/wvJqzaK to see results side-by-side. Or just use testing.openlibrary.org directly. Would love some help testing the various types of search, title search, and search in different languages!