microsoft / GlobalMLBuildingFootprints

Worldwide building footprints derived from satellite imagery
Other
1.4k stars 204 forks source link

File optimization - coordinate decimal places #5

Open rbrundritt opened 2 years ago

rbrundritt commented 2 years ago

A small optimization that I'm sure many would appreciate is to round all coordinates to 6 decimal places. That would be in the low centimeters in terms of accuracy. Currently there appears to be 15 decimal places in the coordinates which would be smaller than a proton.

As a test I ran a simple find and replace using this regular expression on the Maldives file:

Search: ([-0-9]+\.[0-9]{6})[0-9]+\s*,\s*([-0-9]+\.[0-9]{6})[0-9]+ Replace: $1,$2

And saw the following savings:

By making the files smaller, it becomes easier and faster to work with, while also lowering costs. When multiplied by all those who use this, this would also reduce carbon footprint of data processing related to this data, and better align with Microsoft's sustainability initiatives.

gilles-morain commented 2 years ago

Please note however that the test you ran (rounding coordinates) may break the geometries validity, so this has to be done in a less hacky way, preferably upstream (or else everybody will use computing power to do the exact same complex task multiple times).

rbrundritt commented 2 years ago

That was a test to show the potential savings if this was done upstream (thus why I reported this as an issue).

That said, the likelihood of this making any of these geometries invalid is extremely low since the accuracy at 6 decimal places is 15cm and the accuracy of most of the satellite imagery used to create these footprints is at best 15cm's per pixel. So, in order to make one of these geometries invalid using this regular expression, there would need to be two coordinates in a geometry that are in the same pixel of an image. If the regular expression was modified to allow 7 decimal places, we could prove mathematically that it would be impossible to make any of these geometries invalid (unless they were already invalid). Of course, we don't want everyone doing this as it is a lot of processing that should be done once upstream.