roelderickx / ogr2osm

A tool for converting ogr-readable files like shapefiles into .pbf or .osm data

https://pypi.org/project/ogr2osm/

MIT License

59 stars 14 forks source link

The exporting of resulting OSM files can potentially be sped up #37

Open Vectorial1024 opened 1 year ago

Vectorial1024 commented 1 year ago

This requires confirmation later, but I noticed on this StackOverflow discussion:

https://stackoverflow.com/questions/44560655/python-writelines-and-write-huge-time-difference

Why is there such a large difference in the file writing time for write() and writelines() even though it is the same data?

I have used ogr2osm for a while, and I notice that it can be quite slow on larger files. Like, unusually slow.

It seems the exporting can be sped up. Will investigate later.

Vectorial1024 commented 1 year ago

Benchmarking the existing method

First, I must admit my current PC is at mid-high tier, and so things might be faster than average. But the point should still stand even for slower computers. Also, extra care must be taken because the files to be processed can be very large.

Some details:

Command run: python -m ogr2osm -t test_translate -o target.osm source.geojson
Size of data source: about 430 MB
Measuring the duration: adding some basic measurement at DataWriterContextManager.output using time.time()
I/O are all on SSD

I run the command for 5 times.

Measured time (average): 19.455 seconds

Vectorial1024 commented 1 year ago

One thing that sticks out when doing some detailed profiling:

Beginning to time the export
Writing file header
Writing nodes
Writing took (to_xml, write): 7.521965265274048, 0.6387271881103516
Writing ways
Writing took (to_xml, write): 10.0064537525177, 0.5809998512268066
Writing relations
Writing took (to_xml, write): 0, 0
Writing file footer
Time elapsed was 18.999 seconds

It is actually the to_xml part which is slow, not the IO.

It seems we may continue with some sort of multi-threading.

Vectorial1024 commented 1 year ago

Hmmm. We are already using lxml for fast export.

Spawning new threads does not work due to Python's GIL, which effectively encourages single-threaded code.

Playing around with multiprocessing did not bring much immediate results because we will need to do extra work to pass values into the subprocesses. This might be viable in the long term, but not something that can be done in a single day.

If we are able to somehow utilize multi-processing effectively, then perhaps there will be a significant speedup.

Vectorial1024 commented 1 year ago

This just dropped a few days ago:

https://www.bitecode.dev/p/whats-up-python-the-gil-removed-a

THe removal of GIL in Python can be very useful to this speed up: instead of spawning difficult-to-control subprocesses to parallelize XML-to-string, we may finally have a easy-to-control multi-threaded XML-to-string process to speed up exporting.