osmium sort does not remove duplicate nodes

MichalPP commented 2 years ago

I try to create a pbf file for my region of interest (in my case country of Slovakia with 20km buffer). I download Slovakia and surrounding countries from geofabrik, extract each of them by polyline with my region of interest, cat/sort into one file. (or first merge then extract, result is the same) (country extracts have some overlap, but this is by design)

osmium sort country-* -o full.pbf
osmium sort full.pbf -o t.pbf
osmium extract -p ../oma.poly t.pbf -o oma.pbf

however, the last command fails with

Node ID twice in input. Maybe you are using a history or change file?
This command expects the input file to be ordered: First nodes in order of ID,
then ways in order of ID, then relations in order of ID.

osm2pgsql also complains about duplicates (ERROR: Input data is not ordered: node id 7918192 appears more than once.)

t.pbf and full.pbf files are the same (have the same md5sum and size)

since I need original osm ids, I cannot use renumber command. downloading full europa.pbf and then extracting by polyline seems to me like wasteful use of resources.

(debian testing, osmium version 1.14.0, libosmium version 2.17.3, Supported PBF compression types: none zlib lz4)

joto commented 2 years ago

You should never have duplicate nodes in your files, if you do, something is wrong with your processing toolchain. osmium sort or osm2pgsql can not fix your data if it is broken in the first place.

If you want to merge several extracts (from Geofabrik or elsewhere), use osmium merge do merge the files, no sorting necessary. If the files you downloaded were extracted at the same point in time, this will always give you a valid data file without duplicate nodes. Using osmium sort as you do above with several input files will not remove duplicate entries, but osmium merge will.

If you tried with osmium merge and it failed, then most likely the extracts were from different points in time. If you downloaded extract X yesterday and download Y today, they might contain different version of the same object and then osmium merge will leave both versions of the object in the file. But if you merge two pieces of data that don't fit, there is all sorts of problems this could have so you just have to make sure to never do that, that's not something osmium can fix for you.

MichalPP commented 2 years ago

thanks, changing sort to merge did work

I will make a PR for osmium-sort manual page reflecting that sort behaves like cat and in fact does not merge input files.

district10 commented 9 months ago

How to fix bad data with duplicate nodes? I tried osmium merge bad-data-with-dup-nodes.pbf -o output.pbf. Duplicate nodes are still there.

district10 commented 9 months ago

import os
import osmium as o

class UniqHandler(o.SimpleHandler):
    def __init__(self, writer):
        super().__init__()
        self.nodes = set()
        self.ways = set()
        self.relations = set()
        self.writer = writer

    def node(self, o):
        if o.id in self.nodes:
            return
        self.nodes.add(o.id)
        self.writer.add_node(o)

    def way(self, o):
        if o.id in self.ways:
            return
        self.ways.add(o.id)
        self.writer.add_way(o)

    def relation(self, o):
        if o.id in self.relations:
            return
        self.relations.add(o.id)
        self.writer.add_relation(o)

def osm_uniq(input: str, output: str):
    os.makedirs(os.path.dirname(os.path.abspath(output)), exist_ok=True)
    writer = o.SimpleWriter(output)
    UniqHandler(writer).apply_file(input)
    writer.close()

if __name__ == "__main__":
    import fire

    fire.core.Display = lambda lines, out: print(*lines, file=out)
    fire.Fire(osm_uniq)

Wrote a pyosmium script.

osmcode / osmium-tool

osmium sort does not remove duplicate nodes #244