pelias / parser

natural language classification engine for geocoding
https://parser.demo.geocode.earth
MIT License
55 stars 27 forks source link

Right single quotation mark in node name causes it to be unsearchable in autocomplete #169

Open BrindusaN opened 1 year ago

BrindusaN commented 1 year ago

Describe the bug

When searching via autocomplete for this place (В’ячеслава Чорновола вулиця 8), no results are returned. However, reverse geocoding does return the place.

In the autocomplete request I see that the parser disregards everything in front of the right single quotation mark(), causing the subject to be 8 ячеслава Чорновола вулиця (street: ячеслава Чорновола вулиця, housenumber: 8).

I have tested with a different place that has apostrophe instead in it's name and it works as expected:

Is this a parser issue or the schema is also affected? I see in this file that this character is not included.

Steps to Reproduce Search for a place that has a right single quotation mark in it's name using autocomplete

Expected behavior Expected the place to be returned since it exists in the database.

Environment (please complete the following information): NA

Pastebin/Screenshots NA

Additional context NA

References

NA

missinglink commented 1 year ago

Is it one of these quotes? https://github.com/pelias/parser/blob/master/tokenization/split_funcs.js#L10

The Pelias parser treats those quotes as word boundaries, although there is a code comment below noting that this should only be for quote pairs.

missinglink commented 1 year ago

I'm not sure if this is a data error or a code error, surely 'apostrophe' is the correct character to use?

a mark ' used to indicate the omission of letters or figures

The same dictionary describes a quotation mark as:

used chiefly to indicate the beginning and the end of a quotation in which the exact phraseology of another or of a text is directly cited

BrindusaN commented 1 year ago

Hi,

Yes, it is one of the characters in the split_funcs.

AFAIK the right single quotation mark can be used in some languages to alter the sound of a letter (a diacritical mark). Wikipedia describes a right single quotation mark as:

The Unicode character ’ (U+2019 right single quotation mark) is used both for a typographic apostrophe and a single right (closing) quotation mark.

Both the apostrophe and the right single quotation mark are modifier letters. It is used in Ukrainian language.

missinglink commented 1 year ago

Agh ok, thanks for posting that link, we're definitely in this situation of "difficulty of software distinguishing which character is intended by a user's typing".

I don't have the time to work on this right now but I'd be fine with removing it from the quotes array, question is, will that break anything?

A more robust solution would involve splitting these quotes into opening/closing pairs and only considering them as word boundaries when both exist in the text, although this may cause issues with autocomplete.