roelderickx / ogr2osm

A tool for converting ogr-readable files like shapefiles into .pbf or .osm data
https://pypi.org/project/ogr2osm/
MIT License
59 stars 14 forks source link

OOM on large shapefile. #44

Closed p4r4xor closed 11 months ago

p4r4xor commented 1 year ago

I'm trying to convert a shapefile of size 3.3GB to OSM and've run into OOM. I've looked into previous issues and found one very similar to the one I'm facing. I'm using a m5a.4xlarge. Although the process works fine on my local machine (2019 Macbook Pro) but takes quite the time for completion (~5 hours). It eats up some swap memory but haven't seen it go beyond 32GB in this case. SIGINT returns the below trace. What can be the possible reason here?

  File "/home/.local/lib/python3.8/site-packages/ogr2osm/ogr2osm.py", line 281, in main
    osmdata.process(datasource)
  File "/home/.local/lib/python3.8/site-packages/ogr2osm/osm_data.py", line 427, in process
    self.add_feature(ogrfeature, layer_fields, datasource.source_encoding, reproject)
  File "/home/.local/lib/python3.8/site-packages/ogr2osm/osm_data.py", line 375, in add_feature
    osmgeometries = self.__parse_geometry(ogrgeometry, feature_tags)
  File "/home/.local/lib/python3.8/site-packages/ogr2osm/osm_data.py", line 340, in __parse_geometry
    osmgeometries.append(self.__parse_linestring(ogrgeometry, tags))
  File "/home/.local/lib/python3.8/site-packages/ogr2osm/osm_data.py", line 196, in __parse_linestring
    potential_duplicate_ways = [ p for p in node.get_parents() if type(p) == OsmWay ]
  File "/home/.local/lib/python3.8/site-packages/ogr2osm/osm_data.py", line 196, in <listcomp>
    potential_duplicate_ways = [ p for p in node.get_parents() if type(p) == OsmWay ]

P.S. Wouldn't it be better to write file footer using sed? I can see it takes quite a while to write just the footer. Something like this may work? I'm using gsed here for GNU compatibility as MacOS's sed is POSIX compatible only.

gsed -i "$ a </osm>" test.osm
roelderickx commented 1 year ago

Hello,

The OOM exception is thrown whenever the memory is completely filled up, this can be anywhere in the code. In this case it is the point where we check all the parents ways of the first node to check if they are duplicates of the current way, I would be really surprised if this list is long. Do you have points in your input file where thousands of linestrings are crossing? That being said, the m5a.4xlarge has only 4 GiB of memory per vCPU, and since ogr2osm is single-threaded that may be the problem here.

P.S. Wouldn't it be better to write file footer using sed? I can see it takes quite a while to write just the footer. Something like this may work? I'm using gsed here for GNU compatibility as MacOS's sed is POSIX compatible only.

I think it takes a while between the message Writing file footer and the return of the command prompt. In between both points in time not only the footer is written but the file is also closed, causing the system to flush all unwritten data. That's probably what is taking time. If you want to test the theory, disable or remove line 99 in osm_datawriter.py and see if it is significantly faster. You can then append the footer afterwards using you gsed command or by just using the shell (echo '</osm>' >> test.osm), but running that as a subprocess in the code only works after having closed the file and all data is flushed. Otherwise the end tag may show up in the beginning of the file.