Open missinglink opened 9 years ago
this is how the peliasTwoEdgeGram
currently tokenizes that address:
[ '51', 'fr', 'fri', 'frie', 'fried', 'friedr', 'friedri', 'friedric', 'friedrich', 'friedrich-' ]
Leonardo da Vinci–Fiumicino Airport
should be searchable by Fiumicino Airport
http://pelias.mapzen.com/doc?id=geoname:6299619
Add acceptance-tests in order to gauge impact.
Just checked, this is still an area we could improve. Something to think about for the near-ish future
This feature will require alt-names
as the street name above can have 3 forms:
Friedrich-Richter-Straße
Friedrich Richter Straße
FriedrichRichterStraße
moving to alt-names milestone as it can only be solved for a maximum of 2 cases before then.
I am facing a similar (maybe simpler ?) issue with french names.
A search for stade roland-garros
should return similar results as stade roland garros
Would it help to add a hyphen -
in the tokenizers pattern ? (see https://github.com/pelias/schema/blob/master/settings.js#L18) ?
Or would that cause serious regressions with other languages ?
As of the last time we checked in, we were waiting for good alt-names support before tackling this feature. We now have that functionality, and its worth looking at this again.
My guess is that we would want to parse any streetnames coming in with formats like "Friedrich-Richter-Straße or Friedrich Richter Straße and store an alt-name of "FriedrichRichterStraße". This combined with proper hyphen handling would allow us to handle all 3 cases.
Some questions: 1.) would we want to tokenize on hyphens, or handle them in some different way? 2.) Where would we put the code to always take say, street names, and convert them to compound word form? My guess is pelias/model, so that it can be use by all importers. We probably want to start building up a common core of importer functionality anyway.
My guess is that we would want to parse any streetnames coming in with formats like "Friedrich-Richter-Straße or Friedrich Richter Straße and store an alt-name of "FriedrichRichterStraße". This combined with proper hyphen handling would allow us to handle all 3 cases.
Yes, that sounds correct
1.) would we want to tokenize on hyphens, or handle them in some different way?
I think tokenizing on hyphens would work, so long as we can handle the issues that tokenizing brings with it (such as not matching main st
with main ave
but at the same time matching E main st
with W main st
).
2.) Where would we put the code to always take say, street names, and convert them to compound word form? My guess is pelias/model, so that it can be use by all importers. We probably want to start building up a common core of importer functionality anyway.
I would be hesitant to put this logic in pelias/model
, it's clearly super convenient but it might be better to have the code closer to the data (in the importer) so the importer could make data-specific decisions about it's data conventions and optionally apply locale-aware logic which is specific only to certain languages or geographies.
The other option would be to pass the locale information down to the pelias/model
code so that it was able to work with that metadata.
Hi @Joxit, Yes, it's long past time we merge this change or something like it. Let us run a quick full planet build with this branch and take a look. Pretty sure it will be something we can merge right away.
I'll let you know tomorrow :)
edit: oops, this was supposed to be a comment on #375
names such as
51 Friedrich-Richter-Straße
(address-osmnode-2967205513) should be searchable using the tokens['friedrich','richter','strasse']
as well as['friedrichrichterstrasse']
and['friedrich-richter-strasse']