osm-search / Nominatim

Open Source search based on OpenStreetMap data
https://nominatim.org
GNU General Public License v3.0
3.19k stars 714 forks source link

Hebrew Equavilant of "the" is not handled #765

Open ghost opened 7 years ago

ghost commented 7 years ago

There are serious search issues for Hebrew (and Arabic, which I may discuss in a different issue). Here's one of them.

The letter "ה" (U+05D4) often (but not always) means "The" when appearing as the first letter of a word. Currently, Nominatim does not handle this.

I'll write "ה" as "h" below for simplicity. Also, I'll use imperfect letter-for-letter transliteration. (edit: this is NOT a transliteration issue. I used transliteration below purely for convenience, and the issue is related to regular Hebrew)

If something is mapped as "x" and it is searched for as "hX" (English: The X), it will not be found, and vice versa. Such queries are very common.

Possible solution:

Strip the "h" letter from the start of words during indexing and during search queries. (Is this easily done with the current Nominatim codebase?)

Pros: Simple. Cons: Sometimes "h" doesn't mean "the", if "x" is a word, and "hx" is also a word, then "hx" will be stored as "x", and searching for one word may yield the results of the other. But I think this is a rare edge case, and the overall situation would be much better than it is now.

Two letter words shouldn't be stripped, as it would end up with a single letter.

I can offer alternative solutions, but I'd appreciate some feedback first. I am also willing to contribute the required code, if it's not too hard to solve this.

lonvia commented 7 years ago

Thank you for the detailed explanation. It won't be possible to change anything on the current stop word handling until we have come up with a more flexible normalisation algorithm.

ghost commented 7 years ago

Thank you for your response. I'd like to point out that this is not a transliteration issue. I used transliteration above for convenience for those who don't know Hebrew.

lonvia commented 7 years ago

Yes, the 'transliteration' label is for all things related to normalization of queries. Stop word issues belong there, too, because Nominatim does all this in the same step.

ghost commented 7 years ago

Thank you for the detailed explanation. It won't be possible to change anything on the current stop word handling until we have come up with a more flexible normalisation algorithm.

Would you mind explaining the current state a bit more, or direct me to some docs or relevant code?. I might be able to help.

Would handling simple cases only, (such as normalization only by removing specific words or characters) also require waiting for a more flexible normalization algorithm?