internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Search finds transliteration variations easily #2752

Open LeadSongDog opened 4 years ago

LeadSongDog commented 4 years ago

11 was closed in favour of #178 . Consider that

https://openlibrary.org/search?q=%22tiananmen%22&mode=everything finds 447 hits, but https://openlibrary.org/search?q=%22tian%27anmen%22&mode=everything finds only 45 and https://openlibrary.org/search?q=%22tiananmin%22&mode=everything finds zero. Some sort of soundex or metaphone normalization on titles and author names has to be indexed to manage this.

tfmorris commented 4 years ago

There are also:

75 hits - https://openlibrary.org/search?q=tian%27+an+men&mode=everything 4 hits - https://openlibrary.org/search?q=Tienanmen&mode=everything

but this isn't something Soundex will help with. Soundex, Metaphone, Double Metaphone, Daitch-Mokotoff, Beider-Morse, etc are all specific to the English pronunciation of personal names.

The original language version of this word is: 天安门 and some other transliterations include:

Ideally we would want to find the original plus all transliterations with any of the original or transliterations as the query string. An N-gram analyzer, which would also help with misspellings in other languages, might be one strategy which would help.

The transliteration issue isn't specific to Chinese. It affects Japanese, Tibetan, Greek, Russian, etc with varying degrees of complexity.

[Also, to be clear even though they're mentioned here, 11 and 178 (which are duplicates of each other) have nothing to do with this.]

BrittanyBunk commented 4 years ago

I think it's good that the search does transliterations on top of the actual translation, as I have no doubt that people will use that. For me, if I saw a title in another language, I would feed it through Google Translate and it'd give me the transliterate, in which I'd copy/paste in to the search. Another person though might actually know the title and type in that. So I'm really glad to see both in the search. I think the actual titles will need to be manually added in, as I don't expect a computer to know them without doing some deep, neural-network-like research (which I don't think it'll do).