Open mojodna opened 6 years ago
Thanks. I'm looking for a testsuite with PBF data so that we can maybe find problems systematically and add test cases that make sure the fixes keep working in the future. So far I found this: https://github.com/osmcode/osm-testdata which looks promising, but doesn't contain PBF data yet.
I've looked into other osm processing projects and contacted one maintainer, but it seems there is no comprehensive set of pbf testing files out there yet. It's probably a good idea to come up with one. Generating cumbersome PBF files might be a bit involved however, because the libraries, including osm4j-pbf, don't offer support for modifying settings to produce files with specific properties.
Checking in pbf files of dozens of megabytes into a source repository for testing doesn't seem like a good idea (like the peru extract that caused a problem). If you stumble upon smaller files that cause errors, please keep them and send them over. Do you have a copy of the old peru extract that caused the problem?
I've pushed it to S3 since it appears that the Geofabrik download for it has been replaced:
https://mojodna-temp.s3.amazonaws.com/problematic-pbfs/peru-180101.osm.pbf
I have looked into this and come to the conclusion that it is not obvious how to fix this. The handling of partial metadata is inconsistent when looking at osm4j-xml
and osm4j-pbf
. The xml-module assigns default values of -1
to the model class fields when it does not find any attributes, see here.
I'm not sure whether -1
is a good 'magic' value for encoding 'not available', I think some tools use negative values for some purpose. More useful would probably be to store separately in the model whether individual metadata fields are set and provide getters such as hasUid()
.
I checked libosmium
which seems to handle the Peru file properly (and stores a bitmask for availability of individual metadata values). Osmosis
does not do any special handling of partial metadata and fails.
Hmm.
Yeah, -1
seems problematic (since negative ids are used as placeholder values in many contexts), but how else would one encode the int version of Double.NaN
? It was interesting to see Integer.MAX_VALUE
used for null lat/lng values..
One solution would be to have boolean hasX
fields and boolean hasX()
methods for all the metadata fields, i.e. version
, timestamp
, uid
, user
, changeset
and visible
. Assuming that those values are always present (or have reasonable default values like visible = true
) makes consuming code a bit simpler and of course saves some memory. The memory issue could be addressed by combining all boolean flags into a single int
variable and implementing the hasX()
methods accordingly. Although I'm not sure whether that's an optimization worth making since tasks that don't require the metadata can disable it's parsing anyway in which case the whole metadata
field is set to null
.
I'm archiving this repository because this module has been merged into the osm4j master repository. Please re-open the issue there if you want to continue the discussion.
Some PBFs may contain partial metadata without empty values for
uid
,user
, andchangeset
. This prevents an IndexOutOfBoundsException when attempting to read such files.I don't know how to generate such a PBF for a test fixture (and Peru is too large), so swapping in http://download.geofabrik.de/south-america/peru-180101.osm.pbf for
data-with-metadata.pbf
will trigger the bug inTestReadWithMetadata
.