mapbox / geobuf

A compact binary encoding for geographic data.
ISC License
967 stars 84 forks source link

Lossy encoding of high-precision coordinates #96

Open Zetten opened 6 years ago

Zetten commented 6 years ago

I think the assertion in the README that the geobuf encoding is "lossless" is a bit ... enthusiastic.

Specifically, I've found these issues:

So, the first point: calling the codec "lossless", I would expect a roundtrip GeoJSON -> pbf -> GeoJSON would leave the input unchanged. Adding t.same(geobuf.decode(pbf), geojson); to the high-precision test will demonstrate the problem:

    +++ found                                                           
    --- wanted                                                          
           "geometry": {                                                
             "coordinates": [                                           
               [                                                        
                 [                                                      
    -              5425435.73308157                                     
    -              2012689.6354403072                                   
    +              5425435.733082                                       
    +              2012689.63544                                        
                 ]                                                      
                 [                                                      
    -              5425333.066045091                                    
    -              2012658.8061882276                                   
    +              5425333.066045                                       
    +              2012658.806188                                       
                 ]                                                      
                 [                                                      
    -              5425324.357915714                                    
    -              2012693.5183856217                                   
    +              5425324.357916                                       
    +              2012693.518386                                       
                 ]                                                      
                 [                                                      
    -              5425426.519392735                                    
    -              2012720.2386971796                                   
    +              5425426.519393                                       
    +              2012720.238697                                       
                 ]                                                      
                 [                                                      
    -              5425435.73308157                                     
    -              2012689.6354403072                                   
    +              5425435.733082                                       
    +              2012689.63544                                        
                 ]                                                      
               ]                                                        
             ]                                                          
             "type": "Polygon"                                          

This can be worked around by changing encode.js to set maxPrecision = <some bigger value>. Ideally of course this would be determined based on the precision of the input data, or at least taking it as a parameter on the encode function.

However, this exposes the second issue: writing the decoded GeoJSON to a file does not (necessarily) match the input text, if the precision of the input exceeds the maxPrecision.

This is not really avoidable, I think - neither JSON nor GeoJSON specify precision, and it's not incorrect per se. It's reasonable to say that "GeoJSON" is the parsed object representation and not the input string. But I would say it's important to emphasise that point as a caveat to the "lossless" statement.

Finally, and as an artifact of the first and second points, the third issue: Around or exceeding the boundary of the precision of a double (i.e. 1e16) it is possible to encounter an issue where we end up with two coordinates, each of which has the same appropriate precision (i.e. 16 significant figures), but with the geometry as a whole not safely encodable by geobuf. Take a Point:

[
0.01541290815946613,
154.1290815946613
]

The result of a round trip test on this with maxPrecision = 1e16 (the actual max precision):

    +++ found                                                           
    --- wanted                                                          
       "features": [                                                    
         {                                                              
           "geometry": {                                                
             "coordinates": [                                           
    -          0.01541290815946613                                      
    +          0.0154129081594661                                       
               154.1290815946613                                        
             ]                                                          
             "type": "Point"                                            
           }                                                            

If we take instead of precision the scale of the numbers, i.e. the maximum number of digits after the decimal place (maxPrecision = 1e17/e = 17), the problem becomes that the larger element (154...) now overflows the size of sint64 and pbf complains: Given varint doesn't fit into 10 bytes.

Unfortunately I can't see any obvious solution to this. If geobuf insists on the scale * sint64 approach for coordinates, high-precision numbers are going to be tricky to encode losslessly.

A possible solution would be a more complex protobuf encoding similar to Java's BigDecimal, which uses BigInteger as the arbitrary precision integer value - this doesn't have the limitation of sint64, but a more appropriate protobuf type would have to be found (maybe just bytes with a well-defined size? or even string?).

mourner commented 6 years ago

The assumption here is that we're encoding GeoJSON — not just arbitrary data. Per specification, its coordinates are latitude/longitude pairs defining geographic coordinates. With 6 digits after decimal point, it's a ~10cm precision — a precision which GeoJSON data pretty much never exceeds. Perhaps we should have upped that to 7 digits, which is the "practical limit of commercial surveying", but anything more than that is simply pointless.

Zetten commented 6 years ago

I actually encountered this issue on published data[1]: https://data.gov.uk/dataset/regions-december-2016-ultra-generalised-clipped-boundaries-in-england2

Now it's definitely the case that the stupidly high precision in the GeoJSON is incorrect - in their case the numbers are down at subatomic particle radius sort of scale. The floating point fuzziness seems to top out around 9 decimal places, so I'm using this as maxPrecision in my case.

But my point is more that the encoding claims to be lossless, which is evidently not the case.

Also, it's impossible to control the encoding maxPrecision to actually inspect and manage this rounding on a per-input case.

And less importantly, I was confused for a moment by the fact that the tests claim to be 'round trip' when they don't actually perform a full comparison against the input data.

I concede though, the fact that the dropped precision is low impact does mean it's not practically a problem, unless looked at through the data encoding/decoding lens, and when we'd like to prove data integrity or lineage through formats.

[1]: To be more precise, I encountered the issue while implementing a geobuf encoder/decoder of my own, and testing against this data with this reference implementation.

mourner commented 6 years ago

Yeah, I agree with should be clearer about this in the documentation. Regarding tests — they do perform a "deepEquals" comparison of input GeoJSON with the roundtripped one: https://github.com/mapbox/geobuf/blob/master/test/validate.test.js#L83

Zetten commented 6 years ago

Aha, so they do! I was completely focused on the precision.json test which seems to omit that bit.

mourner commented 6 years ago

@Zetten that particular test was for an earlier issue that got fixed, and specifically about closing points of a polygon ring not matching the first ones.

nbolten commented 6 years ago

I spent some time trying to distribute data in geobuf, but encountered downstream data errors due to a combination of a pygeobuf bug and the default 1e6 precision issue.

While precision greater than 10 cm does seem absurd for most use cases, it's not below the threshold for some common features. For example, let's say there is a lamp post and bus stop right next to one another (8 cm apart). We may not actually know their absolute locations to sub-cm (1e7) precision, but wish to preserve their relative spatial relationship: they aren't perfectly colocated. A hard 1e6 precision has over a 50% chance of deciding that those two separate features are at the exact same location.

I ran into similar issues when attempting to use geobuf for pedestrian infrastructure data - 10 cm starts to matter at that level of detail. I would strongly prefer that the spec / implementations advertise the lossiness of the approach or allow more control over precision.

mourner commented 6 years ago

I've updated the readme — changed the statement from "lossless" to "nearly lossless" and added some clarification around that.

Zetten commented 6 years ago

To be honest though, "nearly lossless" is still "lossy".

IMHO the fact that any information lost is of a high precision on the human scale should be irrelevant when talking about the properties of a data format - by analogy, we don't claim high-bitrate MP3 tracks are "nearly lossless" even though the human ear can't distinguish between them and raw WAV or truly losslessly-encoded equivalents. (Not to mention @nbolten's observation that geobuf isn't currently working at a sufficiently useful precision for real-world applications.)

I think I'd prefer a neutral warning statement like "geobuf can currently only represent coordinates to a precision of X", leaving any decision between the value-loaded terms lossy/lossless out of it. It might not make geobuf sound quite so great, but it avoids any confusion or misrepresentation.

reyemtm commented 4 years ago

Sorry I'm coming late to this discussion, but is there no way to set the desired precision, for example in options or manually?