trim tokens - Githubissues

pelias / model

Pelias data models

6 stars 17 forks source link

trim tokens #16

Closed missinglink closed 8 years ago

missinglink commented 9 years ago

migrating issue https://github.com/pelias/suggester-pipeline/issues/5 to this repo

missinglink commented 9 years ago

http://pelias.mapzen.com/doc?id=osmnode:2235377518

has opening quote for it's outlandish claim, no closing quote.

"text": "Nobis \"the best pizza, Monte di Procida, NAPOLI"

missinglink commented 8 years ago

had a think about this, we could easily add a transform.trim() function to all the string getter/setter functions but my feeling is that the importers should take responsibility for performing string operations on their data, it's not really the responsibility of this repo to do so; although we do in some cases lowercase or uppercase for consistency (eg. for alpha3).

if we did decide to do it here, would it only be whitespace? trimming additional chars like " sounds dangerous, as per the example above: foo "bar" would become foo "bar.

thoughts?

trescube commented 8 years ago

Don't the ES analyzers we employ trim the tokens?

orangejulius commented 8 years ago

@trescube yes, but then the documents returned by ES would still have whitespace in them.

The openaddresses importer will, as of pelias/openaddresses#31, have whitespace trimming fully taken care of. This is mostly so that it can guarantee all records that are inserted into Elasticsearch are valid, and don't have fields that consist entirely of whitespace. So it seems like whitespace trimming is important for importers to do. That said, just trimming whitespace is pretty easy, so doing it everwhere isn't the worst. I definitely don't think we should trim any other characters.

missinglink commented 8 years ago

this issue is old and seems to be a low priority.

as discussed above, it would be best to fix data in the source importer rather than in the model itself.