Check for presence of user metadata subset

mojodna commented 6 years ago

Some PBFs may contain partial metadata without empty values for uid, user, and changeset. This prevents an IndexOutOfBoundsException when attempting to read such files.

I don't know how to generate such a PBF for a test fixture (and Peru is too large), so swapping in http://download.geofabrik.de/south-america/peru-180101.osm.pbf for data-with-metadata.pbf will trigger the bug in TestReadWithMetadata.

sebkur commented 6 years ago

Thanks. I'm looking for a testsuite with PBF data so that we can maybe find problems systematically and add test cases that make sure the fixes keep working in the future. So far I found this: https://github.com/osmcode/osm-testdata which looks promising, but doesn't contain PBF data yet.

sebkur commented 6 years ago

I've looked into other osm processing projects and contacted one maintainer, but it seems there is no comprehensive set of pbf testing files out there yet. It's probably a good idea to come up with one. Generating cumbersome PBF files might be a bit involved however, because the libraries, including osm4j-pbf, don't offer support for modifying settings to produce files with specific properties.

Checking in pbf files of dozens of megabytes into a source repository for testing doesn't seem like a good idea (like the peru extract that caused a problem). If you stumble upon smaller files that cause errors, please keep them and send them over. Do you have a copy of the old peru extract that caused the problem?

mojodna commented 6 years ago

I've pushed it to S3 since it appears that the Geofabrik download for it has been replaced:

https://mojodna-temp.s3.amazonaws.com/problematic-pbfs/peru-180101.osm.pbf

sebkur commented 6 years ago

I have looked into this and come to the conclusion that it is not obvious how to fix this. The handling of partial metadata is inconsistent when looking at osm4j-xml and osm4j-pbf. The xml-module assigns default values of -1 to the model class fields when it does not find any attributes, see here.

I'm not sure whether -1 is a good 'magic' value for encoding 'not available', I think some tools use negative values for some purpose. More useful would probably be to store separately in the model whether individual metadata fields are set and provide getters such as hasUid().

I checked libosmium which seems to handle the Peru file properly (and stores a bitmask for availability of individual metadata values). Osmosis does not do any special handling of partial metadata and fails.

mojodna commented 6 years ago

Hmm.

Yeah, -1 seems problematic (since negative ids are used as placeholder values in many contexts), but how else would one encode the int version of Double.NaN? It was interesting to see Integer.MAX_VALUE used for null lat/lng values..

sebkur commented 6 years ago

One solution would be to have boolean hasX fields and boolean hasX() methods for all the metadata fields, i.e. version, timestamp, uid, user, changeset and visible. Assuming that those values are always present (or have reasonable default values like visible = true) makes consuming code a bit simpler and of course saves some memory. The memory issue could be addressed by combining all boolean flags into a single int variable and implementing the hasX() methods accordingly. Although I'm not sure whether that's an optimization worth making since tasks that don't require the metadata can disable it's parsing anyway in which case the whole metadata field is set to null.

sebkur commented 4 years ago

I'm archiving this repository because this module has been merged into the osm4j master repository. Please re-open the issue there if you want to continue the discussion.

topobyte / osm4j-pbf

Check for presence of user metadata subset #6