Closed MichalPP closed 2 years ago
You should never have duplicate nodes in your files, if you do, something is wrong with your processing toolchain. osmium sort
or osm2pgsql
can not fix your data if it is broken in the first place.
If you want to merge several extracts (from Geofabrik or elsewhere), use osmium merge
do merge the files, no sorting necessary. If the files you downloaded were extracted at the same point in time, this will always give you a valid data file without duplicate nodes. Using osmium sort
as you do above with several input files will not remove duplicate entries, but osmium merge
will.
If you tried with osmium merge
and it failed, then most likely the extracts were from different points in time. If you downloaded extract X yesterday and download Y today, they might contain different version of the same object and then osmium merge
will leave both versions of the object in the file. But if you merge two pieces of data that don't fit, there is all sorts of problems this could have so you just have to make sure to never do that, that's not something osmium can fix for you.
thanks, changing sort
to merge
did work
I will make a PR for osmium-sort
manual page reflecting that sort behaves like cat and in fact does not merge input files.
How to fix bad data with duplicate nodes? I tried osmium merge bad-data-with-dup-nodes.pbf -o output.pbf
. Duplicate nodes are still there.
import os
import osmium as o
class UniqHandler(o.SimpleHandler):
def __init__(self, writer):
super().__init__()
self.nodes = set()
self.ways = set()
self.relations = set()
self.writer = writer
def node(self, o):
if o.id in self.nodes:
return
self.nodes.add(o.id)
self.writer.add_node(o)
def way(self, o):
if o.id in self.ways:
return
self.ways.add(o.id)
self.writer.add_way(o)
def relation(self, o):
if o.id in self.relations:
return
self.relations.add(o.id)
self.writer.add_relation(o)
def osm_uniq(input: str, output: str):
os.makedirs(os.path.dirname(os.path.abspath(output)), exist_ok=True)
writer = o.SimpleWriter(output)
UniqHandler(writer).apply_file(input)
writer.close()
if __name__ == "__main__":
import fire
fire.core.Display = lambda lines, out: print(*lines, file=out)
fire.Fire(osm_uniq)
Wrote a pyosmium script.
I try to create a pbf file for my region of interest (in my case country of Slovakia with 20km buffer). I download Slovakia and surrounding countries from geofabrik, extract each of them by polyline with my region of interest, cat/sort into one file. (or first merge then extract, result is the same) (country extracts have some overlap, but this is by design)
however, the last command fails with
osm2pgsql also complains about duplicates (ERROR: Input data is not ordered: node id 7918192 appears more than once.)
t.pbf and full.pbf files are the same (have the same md5sum and size)
since I need original osm ids, I cannot use renumber command. downloading full europa.pbf and then extracting by polyline seems to me like wasteful use of resources.
(debian testing, osmium version 1.14.0, libosmium version 2.17.3, Supported PBF compression types: none zlib lz4)