GPX upload : zip file handling

mmd-osm commented 5 years ago

Follow up #2131: GPX upload uses external scripts to decompress zip/bzip/gzip files. To be on the safe side, some more input sanitization is required here.

We also need to improve zip file handling in general here, so people can't kill the server by uploading funny zip bombs. https://github.com/openstreetmap/openstreetmap-website/blob/268a8cb06e0a4734b9cb226ecebcc8445be4a9de/app/models/trace.rb#L256-L268

mmd-osm commented 4 years ago

Also, this user managed to block all gpx processing for 3 hours by uploading a 250MB trace as a zip file: https://www.openstreetmap.org/user/Arne%20Schwarck/traces/3402427

I sent a PN to the user asking them to stop further uploads until this has been resolved.

tomhughes commented 4 years ago

Your claim might be true if there was only one machine processing trace uploads but that hasn't been true for some time.

tomhughes commented 4 years ago

I also don't see what any of that has to do with this bug.

tomhughes commented 4 years ago

In fact this issue was resolved in 6c159b96734f81efc24f2c1410cd979b5c272819.

tomhughes commented 4 years ago

It was never actually a problem by the way as trace_name is entirely controlled by us.

mmd-osm commented 4 years ago

How many workers are currently processing jobs in parallel? There seems to be a pretty large backlog right now.

(Yes, this should be in another issue, really)

tomhughes commented 4 years ago

Three, but there were a bunch of large jobs uploaded by that user. That's the point of the queue though, to be able to handle spikes, so having occasional backlogs is it working as designed.

arne182 commented 4 years ago

My points should be finished within the next 12 hours. Usually the server was a bit faster in the past and my files a bit smaller. It does take time to process 3 million points. Next time I will try and limit the uploads to 1 million points. I do see the server running and confirming my successful uploads so all should be back to usual within the next 24 hours. Until next mouth where another 20 million points are expected.

mmd-osm commented 4 years ago

Maybe it isn't ideal that one user can basically take over the gpx import feature, so that everyone else needs to wait for +12 hours.

Can we lower the job priority for large imports, or if a user already has x pending jobs?

arne182 commented 4 years ago

I don't need the files to be available immediately, so that does sound like a good idea. As long as the 20 million points are done processing within a month so that my queue us finished when I upload the next set of files. I am fine with that.

tomhughes commented 4 years ago

Equally are these trace really useful. I just looked at one and it has data spanning several weeks, although not continuous, but when it is recording it's doing like 8 points/second which is kind of overkill unless you're going ridiculously fast.

arne182 commented 4 years ago

If you guys have some code that could help me reduce the points. That would really help. The 50cm accurate sensor is polling at 10 times a second. So for high curvatures roads and for high speeds it is useful. All points uploaded are less than 2m accuracy so if you are anyway only using straight lines the straight line data could be refunded to end and start point. This data is coming from the autonomous driving fleet of ArnePilot.

mmd-osm commented 4 years ago

GPSBabel + Douglas Peucker come to mind.. Maybe ask on the OSM forum, there are some existing threads on the topic: https://forum.openstreetmap.org/viewtopic.php?id=7801

tomhughes commented 4 years ago

Yes, specifically the simplify filter.

arne182 commented 4 years ago

Thanks. Will add the python implementation into my processing scripts. I was anyway wondering if this data couldn't be used to auto fit the ways? At the moment I am moving the points by hand but if the data is of such high quality why not have the ways snapped to the average points?

tomhughes commented 4 years ago

Because that is not what OSM is, but in any case how do we determine which traces are high enough quality, and which ways they correspond to?

It's nothing to do with this ticket anyway.

tomhughes commented 4 years ago

It took about seven hours but running gpsbabel with the simplify filter and an error limit of 1m reduced the 230Mb referenced in this ticket to 3Mb - that's a reduction from 1.6 million points to 22 thousand points. Command line was:

gpsbabel -r -i gpx -f gps-data.44284.gpx -x simplify,error=0.001k -o gpx -F out.gpx

arne182 commented 4 years ago

Would this not be a good optimisation option to run on the server? I will see how long the python script takes on the same file and report the results.

mmd-osm commented 4 years ago

That's like blocking the import queue for 7 hours instead of 3 hours per gpx archive, which doesn't seem exactly helpful to improve import times. I think you should really run this for yourself on your local machine.

arne182 commented 4 years ago

Might it not be useful to do this on the whole dataset. If you are anyway just drawing straight lines between points. And the reduction of 76x the space required is quite a huge amount. But maybe your database has good compression and it might not bring much. Have you visually compared the two files to see that there is not a loss of resolution?

tomhughes commented 4 years ago

The actual issues in this ticket have long since been addressed in full.

openstreetmap / openstreetmap-website

GPX upload : zip file handling #2137