pelias / polylines

Pelias import pipeline for polyline (road network) data.
MIT License
17 stars 24 forks source link

invalid regex test #216

Closed jeremy-rutman closed 5 years ago

jeremy-rutman commented 5 years ago

This happens only a few times and thus I suppose isn't a big deal, but there's some check for http:// and https:// which conceivably could be relaxed:

npm start 
....
2019-04-15T15:01:21.459Z - error: [polyline] polyline document error message=invalid regex test, https://www.oregonhikers.org/field_guide/Niagara_Park_Loop_Hike should not match /https?:\/\//, stack=PeliasModelError: invalid regex test, https://www.oregonhikers.org/field_guide/Niagara_Park_Loop_Hike should not match /https?:\/\//
    at Object.nomatch (/home/jeremyr/pelias_metal/polylines/node_modules/pelias-model/util/valid.js:117:13)
    at Document.setName (/home/jeremyr/pelias_metal/polylines/node_modules/pelias-model/Document.js:258:18)
    at DestroyableTransform._transform (/home/jeremyr/pelias_metal/polylines/stream/document.js:30:11)
    at DestroyableTransform.Transform._read (/home/jeremyr/pelias_metal/polylines/node_modules/readable-stream/lib/_stream_transform.js:177:10)
    at DestroyableTransform.Readable.read (/home/jeremyr/pelias_metal/polylines/node_modules/readable-stream/lib/_stream_readable.js:440:10)
    at flow (/home/jeremyr/pelias_metal/polylines/node_modules/readable-stream/lib/_stream_readable.js:920:34)
    at ParallelTransform.pipeOnDrainFunctionResult (/home/jeremyr/pelias_metal/polylines/node_modules/readable-stream/lib/_stream_readable.js:730:7)
    at ParallelTransform.emit (events.js:189:13)
    at onwriteDrain (/home/jeremyr/pelias_metal/polylines/node_modules/parallel-transform/node_modules/readable-stream/lib/_stream_writable.js:501:12)
    at afterWrite (/home/jeremyr/pelias_metal/polylines/node_modules/parallel-transform/node_modules/readable-stream/lib/_stream_writable.js:489:18), name=PeliasModelError
2019-04-15T15:01:21.606Z - info: [dbclient-polylines]  paused=true, transient=10, current_length=0, indexed=8556000, batch_ok=17112, batch_retries=0, failed_records=0, street=8556000, persec=1950
2019-04-15T15:01:29.545Z - debug: [wof-admin-lookup] no country lon=38.052706, lat=44.552078, 
missinglink commented 5 years ago

How do you think it could be relaxed?

jeremy-rutman commented 5 years ago

I am not clear on why those records are being filtered; if these are tags on polylines it appears to me the tag in this case is actually ok and if this is always the case then that regex test could be dropped. Not knowing more abt. the reason for this filter, I defer to your greater knowledge, experience, and geo-wisdom and apologize if I've brought an irrelevant issue.

missinglink commented 5 years ago

I think it's ok in this case, the regex in question is being enforced on Document.setName.

The stack trace is pretty verbose and so it's kind of hard to see it unfortunately.

The intention of that test in the context of this repo is to remove streets who's name has erroneously been set as a URL in openstreetmap, which is actually reasonably common.

I think the test itself is fine as-is because I'd prefer not to import streets with a URL as their name but I wonder if we can improve the error logging to be clearer of what's going on.

jeremy-rutman commented 5 years ago

It makes more sense as a debug message than an error message; I am not familiar with js logging but do know that in python you can log to stdout and also different files for different levels - if thats poss. in .js then it would be useful to eg send all the filtered records (like the above one, neighborhoods without cities, etc) to a single file which can then be checked

mohammedayub44 commented 5 years ago

@missinglink I ran into these errors as well, so we just safely ignore them for now ?

missinglink commented 5 years ago

Yea you can ignore them.

They are caused by people entering urls in the name field of a dataset, so in this case openstreetmap.

I think I need to change the error reporting a bit, I'm personally a big fan of emitting verbose errors in the logs so that developers are aware of bugs and can fix them, rather than silencing them and forgetting about them.

What's probably best is to log these under a 'debug' log level for developers only because most people don't really need to see them and they look scary.

missinglink commented 5 years ago

Same goes for the 'denormalization failed' messages during the OSM import, they look scary but are innocuous.

orangejulius commented 5 years ago

Last I checked, on a full planet build there are only a hundred or so of these errors. With even a few people it wouldn't take long to fix all the OSM nodes/ways/relations with incorrect name fields that contain a URL. I know missinglink and I have both done a few, so we are already on our way there!

orangejulius commented 5 years ago

Okay, and just to make it easier, here's an Overpass Turbo query that will return any OSM records with any name field containing http:// or https://: https://overpass-turbo.eu/s/JJN

[out:csv(::id,::type,"name")];
nwr[~"^name(:.*)?$"~"^https?://"]
  ({{bbox}});
out;