hyphenated names - Githubissues

missinglink commented 9 years ago

names such as 51 Friedrich-Richter-Straße (address-osmnode-2967205513) should be searchable using the tokens ['friedrich','richter','strasse'] as well as ['friedrichrichterstrasse'] and ['friedrich-richter-strasse']

missinglink commented 9 years ago

see: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-compound-word-tokenfilter.html

missinglink commented 9 years ago

this is how the peliasTwoEdgeGram currently tokenizes that address: [ '51', 'fr', 'fri', 'frie', 'fried', 'friedr', 'friedri', 'friedric', 'friedrich', 'friedrich-' ]

missinglink commented 9 years ago

Leonardo da Vinci–Fiumicino Airport should be searchable by Fiumicino Airport http://pelias.mapzen.com/doc?id=geoname:6299619

dianashk commented 9 years ago

Add acceptance-tests in order to gauge impact.

orangejulius commented 8 years ago

Just checked, this is still an area we could improve. Something to think about for the near-ish future

missinglink commented 8 years ago

This feature will require alt-names as the street name above can have 3 forms:

Friedrich-Richter-Straße
Friedrich Richter Straße
FriedrichRichterStraße

moving to alt-names milestone as it can only be solved for a maximum of 2 cases before then.

amatissart commented 6 years ago

I am facing a similar (maybe simpler ?) issue with french names. A search for stade roland-garros should return similar results as stade roland garros

Would it help to add a hyphen - in the tokenizers pattern ? (see https://github.com/pelias/schema/blob/master/settings.js#L18) ? Or would that cause serious regressions with other languages ?

orangejulius commented 6 years ago

As of the last time we checked in, we were waiting for good alt-names support before tackling this feature. We now have that functionality, and its worth looking at this again.

My guess is that we would want to parse any streetnames coming in with formats like "Friedrich-Richter-Straße or Friedrich Richter Straße and store an alt-name of "FriedrichRichterStraße". This combined with proper hyphen handling would allow us to handle all 3 cases.

Some questions: 1.) would we want to tokenize on hyphens, or handle them in some different way? 2.) Where would we put the code to always take say, street names, and convert them to compound word form? My guess is pelias/model, so that it can be use by all importers. We probably want to start building up a common core of importer functionality anyway.

missinglink commented 6 years ago

My guess is that we would want to parse any streetnames coming in with formats like "Friedrich-Richter-Straße or Friedrich Richter Straße and store an alt-name of "FriedrichRichterStraße". This combined with proper hyphen handling would allow us to handle all 3 cases.

Yes, that sounds correct

1.) would we want to tokenize on hyphens, or handle them in some different way?

I think tokenizing on hyphens would work, so long as we can handle the issues that tokenizing brings with it (such as not matching main st with main ave but at the same time matching E main st with W main st).

2.) Where would we put the code to always take say, street names, and convert them to compound word form? My guess is pelias/model, so that it can be use by all importers. We probably want to start building up a common core of importer functionality anyway.

I would be hesitant to put this logic in pelias/model, it's clearly super convenient but it might be better to have the code closer to the data (in the importer) so the importer could make data-specific decisions about it's data conventions and optionally apply locale-aware logic which is specific only to certain languages or geographies.

The other option would be to pass the locale information down to the pelias/model code so that it was able to work with that metadata.

orangejulius commented 5 years ago

Hi @Joxit, Yes, it's long past time we merge this change or something like it. Let us run a quick full planet build with this branch and take a look. Pretty sure it will be something we can merge right away.

I'll let you know tomorrow :)

edit: oops, this was supposed to be a comment on #375

pelias / schema

hyphenated names #65