onthegomap / planetiler

Flexible tool to build planet-scale vector tilesets from OpenStreetMap data fast
Apache License 2.0
1.2k stars 100 forks source link

Iniitial geoparquet support #888

Closed msbarry closed 1 month ago

msbarry commented 1 month ago

Add initial geoparquet support to planetiler for reading datasets like overture maps.

Planetiler will attempt to read geoparquet metadata from the "geo" file metadata field to determine which field contains the default geometry on each field and how to deserialize it (including geoarrow geometries). If that's missing, it will fall back to geometry, wkb_geometry, or wkt_geometry field (similar to gdal).

Parquet supports structured attributes like maps and lists. For now the SourceFeature API is unchanged, so you may get back a Map<String, List<Object>> from feature.getTag(name). A future PR will add more convenient API for working with structured tags.

The --bounds bbox argument gets converted to a push-down predicate that lets planetiler avoid reading entire files, row groups, and records that fall outside the bounding box. For example since overture data is sorted roughly geographically if you specify a bounding box for a city like Boston, it can select and process all the features in less than 5 seconds.

The apache java parquet reader is tightly coupled to the rest of hadoop and cannot easily be used on its own (see https://issues.apache.org/jira/browse/PARQUET-1126), so to avoid pulling in many mb's of dependencies this PR uses the parquet-floor project that uses the minimal set of dependencies and stubs-out the rest so the jar size only goes up from 70 to 84mb.

Planned for followup PRs:

github-actions[bot] commented 1 month ago
This Branch cd1bff2591b5fad5ac5ee8e98fa36a4118899de7 Base bcaee6865d92f6713217e0b67f4cb95d3691d502
``` 0:01:09 DEB [archive] - Tile stats: 0:01:09 DEB [archive] - Biggest tiles (gzipped) 1. 14/4942/6092 (154k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.40015 (poi:83k) 2. 9/154/190 (149k) https://onthegomap.github.io/planetiler-demo/#9.5/41.77078/-71.36719 (landcover:85k) 3. 10/308/380 (138k) https://onthegomap.github.io/planetiler-demo/#10.5/41.90214/-71.54297 (landcover:66k) 4. 10/308/381 (136k) https://onthegomap.github.io/planetiler-demo/#10.5/41.63994/-71.54297 (landcover:72k) 5. 14/4941/6092 (111k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.42212 (poi:64k) 6. 14/4941/6093 (110k) https://onthegomap.github.io/planetiler-demo/#14.5/41.81227/-71.42212 (building:62k) 7. 14/4940/6092 (99k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.44409 (building:92k) 8. 11/616/762 (98k) https://onthegomap.github.io/planetiler-demo/#11.5/41.7057/-71.63086 (landcover:71k) 9. 14/4942/6091 (96k) https://onthegomap.github.io/planetiler-demo/#14.5/41.84501/-71.40015 (building:79k) 10. 11/616/761 (96k) https://onthegomap.github.io/planetiler-demo/#11.5/41.83679/-71.63086 (landcover:72k) 0:01:09 DEB [archive] - Max tile sizes z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 all boundary 154 374 443 583 938 339 433 548 773 1.6k 2.1k 7.2k 6.4k 5.8k 4.5k 7.2k water 7.7k 3.7k 8.6k 5.5k 2.6k 5.1k 15k 18k 16k 26k 15k 13k 17k 15k 12k 26k place 0 0 441 441 441 639 712 1k 1.5k 3.1k 5.6k 3.3k 1.7k 795 936 5.6k landuse 0 0 0 0 548 694 1.6k 6.8k 17k 44k 59k 50k 38k 19k 12k 59k transportation 0 0 0 0 243 782 1.2k 5.9k 8k 24k 17k 19k 65k 48k 33k 65k waterway 0 0 0 0 111 118 0 0 0 3.1k 2.4k 2.1k 2.1k 4.9k 2.4k 4.9k park 0 0 0 0 0 0 1k 3.7k 9.7k 19k 13k 8.2k 4.3k 3.4k 4.4k 19k transportation_name 0 0 0 0 0 0 369 464 1.2k 1.8k 5.4k 4.6k 3.9k 3.4k 18k 18k landcover 0 0 0 0 0 0 0 9.5k 29k 85k 72k 81k 53k 30k 24k 85k mountain_peak 0 0 0 0 0 0 0 1.1k 1.8k 3.4k 4.3k 2.8k 1.4k 1.4k 869 4.3k water_name 0 0 0 0 0 0 0 0 0 486 461 433 452 1.2k 1.5k 1.5k aerodrome_label 0 0 0 0 0 0 0 0 0 0 664 327 273 220 220 664 aeroway 0 0 0 0 0 0 0 0 0 0 1.6k 2.1k 3k 3.4k 2.7k 3.4k poi 0 0 0 0 0 0 0 0 0 0 0 0 501 498 83k 83k building 0 0 0 0 0 0 0 0 0 0 0 0 0 59k 92k 92k housenumber 0 0 0 0 0 0 0 0 0 0 0 0 0 0 35k 35k full tile 7.9k 4k 9.5k 6.5k 3.7k 6k 20k 42k 85k 203k 185k 135k 114k 128k 244k 244k gzipped 6.2k 3.5k 7.1k 5.2k 3.1k 4.8k 14k 29k 60k 149k 138k 98k 83k 91k 154k 154k 0:01:09 DEB [archive] - Max tile: 244k (gzipped: 154k) 0:01:09 DEB [archive] - Avg tile: 5.4k (gzipped: 4k) using weighted average based on OSM traffic 0:01:09 DEB [archive] - # tiles: 4,115,012 0:01:09 DEB [archive] - # features: 5,484,360 0:01:09 INF [archive] - Finished in 19s cpu:1m8s avg:3.7 0:01:09 INF [archive] - read 1x(3% 0.6s wait:17s done:1s) 0:01:09 INF [archive] - encode 4x(55% 10s wait:2s done:1s) 0:01:09 INF [archive] - write 1x(22% 4s wait:12s done:1s) 0:01:09 INF [archive] - Finished in 1m10s cpu:3m30s gc:1s avg:3 0:01:09 INF [archive] - FINISHED! 0:01:09 INF [archive] - 0:01:09 INF [archive] - ---------------------------------------- 0:01:09 INF [archive] - data errors: 0:01:09 INF [archive] - render_snap_fix_input 16,639 0:01:09 INF [archive] - osm_multipolygon_missing_way 389 0:01:09 INF [archive] - osm_boundary_missing_way 73 0:01:09 INF [archive] - merge_snap_fix_input 12 0:01:09 INF [archive] - osm_boundary_duplicate_member 2 0:01:09 INF [archive] - feature_centroid_if_convex_osm_invalid_multipolygon_empty_after_fix 2 0:01:09 INF [archive] - feature_polygon_osm_invalid_multipolygon_empty_after_fix 2 0:01:09 INF [archive] - omt_park_area_osm_invalid_multipolygon_empty_after_fix 1 0:01:09 INF [archive] - omt_fix_water_before_ne_intersect 1 0:01:09 INF [archive] - feature_point_on_surface_osm_invalid_multipolygon_empty_after_fix 1 0:01:09 INF [archive] - ---------------------------------------- 0:01:09 INF [archive] - overall 1m10s cpu:3m30s gc:1s avg:3 0:01:09 INF [archive] - lake_centerlines 3s cpu:6s avg:1.9 0:01:09 INF [archive] - read 1x(14% 0.5s done:3s) 0:01:09 INF [archive] - process 4x(0% 0s done:3s) 0:01:09 INF [archive] - write 1x(0% 0s done:3s) 0:01:09 INF [archive] - water_polygons 15s cpu:39s avg:2.7 0:01:09 INF [archive] - read 1x(42% 6s done:7s) 0:01:09 INF [archive] - process 4x(25% 4s wait:4s done:5s) 0:01:09 INF [archive] - write 1x(4% 0.5s wait:9s done:5s) 0:01:09 INF [archive] - natural_earth 12s cpu:18s avg:1.5 0:01:09 INF [archive] - read 1x(52% 6s done:6s) 0:01:09 INF [archive] - process 4x(7% 0.8s wait:6s done:6s) 0:01:09 INF [archive] - write 1x(0% 0s wait:6s done:6s) 0:01:09 INF [archive] - osm_pass1 2s cpu:6s avg:3.2 0:01:09 INF [archive] - read 1x(2% 0s wait:2s) 0:01:09 INF [archive] - parse 4x(35% 0.6s) 0:01:09 INF [archive] - process 1x(67% 1s) 0:01:09 INF [archive] - osm_pass2 17s cpu:1m7s avg:3.9 0:01:09 INF [archive] - read 1x(0% 0s wait:10s done:7s) 0:01:09 INF [archive] - process 4x(76% 13s) 0:01:09 INF [archive] - write 1x(3% 0.4s wait:17s) 0:01:09 INF [archive] - ne_lakes 0s cpu:0s avg:14.6 0:01:09 INF [archive] - boundaries 0s cpu:0s avg:1.3 0:01:09 INF [archive] - agg_stop 0s cpu:0s avg:0 0:01:09 INF [archive] - sort 1s cpu:4s avg:2.7 0:01:09 INF [archive] - worker 1x(49% 0.7s) 0:01:09 INF [archive] - archive 19s cpu:1m8s avg:3.7 0:01:09 INF [archive] - read 1x(3% 0.6s wait:17s done:1s) 0:01:09 INF [archive] - encode 4x(55% 10s wait:2s done:1s) 0:01:09 INF [archive] - write 1x(22% 4s wait:12s done:1s) 0:01:09 INF [archive] - ---------------------------------------- 0:01:09 INF [archive] - archive 108MB 0:01:09 INF [archive] - features 281MB ``` ``` 0:01:03 DEB [archive] - Tile stats: 0:01:03 DEB [archive] - Biggest tiles (gzipped) 1. 14/4942/6092 (154k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.40015 (poi:83k) 2. 9/154/190 (149k) https://onthegomap.github.io/planetiler-demo/#9.5/41.77078/-71.36719 (landcover:85k) 3. 10/308/380 (138k) https://onthegomap.github.io/planetiler-demo/#10.5/41.90214/-71.54297 (landcover:66k) 4. 10/308/381 (136k) https://onthegomap.github.io/planetiler-demo/#10.5/41.63994/-71.54297 (landcover:72k) 5. 14/4941/6092 (111k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.42212 (poi:64k) 6. 14/4941/6093 (110k) https://onthegomap.github.io/planetiler-demo/#14.5/41.81227/-71.42212 (building:62k) 7. 14/4940/6092 (99k) https://onthegomap.github.io/planetiler-demo/#14.5/41.82864/-71.44409 (building:92k) 8. 11/616/762 (98k) https://onthegomap.github.io/planetiler-demo/#11.5/41.7057/-71.63086 (landcover:71k) 9. 14/4942/6091 (96k) https://onthegomap.github.io/planetiler-demo/#14.5/41.84501/-71.40015 (building:79k) 10. 11/616/761 (96k) https://onthegomap.github.io/planetiler-demo/#11.5/41.83679/-71.63086 (landcover:72k) 0:01:03 DEB [archive] - Max tile sizes z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 all boundary 154 374 443 583 938 339 433 548 773 1.6k 2.1k 7.2k 6.4k 5.8k 4.5k 7.2k water 7.7k 3.7k 8.6k 5.5k 2.6k 5.1k 15k 18k 16k 26k 15k 13k 17k 15k 12k 26k place 0 0 441 441 441 639 712 1k 1.5k 3.1k 5.6k 3.3k 1.7k 795 936 5.6k landuse 0 0 0 0 548 694 1.6k 6.8k 17k 44k 59k 50k 38k 19k 12k 59k transportation 0 0 0 0 243 782 1.2k 5.9k 8k 24k 17k 19k 65k 48k 33k 65k waterway 0 0 0 0 111 118 0 0 0 3.1k 2.4k 2.1k 2.1k 4.9k 2.4k 4.9k park 0 0 0 0 0 0 1k 3.7k 9.7k 19k 13k 8.2k 4.3k 3.4k 4.4k 19k transportation_name 0 0 0 0 0 0 369 464 1.2k 1.8k 5.4k 4.6k 3.9k 3.4k 18k 18k landcover 0 0 0 0 0 0 0 9.5k 29k 85k 72k 81k 53k 30k 24k 85k mountain_peak 0 0 0 0 0 0 0 1.1k 1.8k 3.4k 4.3k 2.8k 1.4k 1.4k 869 4.3k water_name 0 0 0 0 0 0 0 0 0 486 461 433 452 1.2k 1.5k 1.5k aerodrome_label 0 0 0 0 0 0 0 0 0 0 664 327 273 220 220 664 aeroway 0 0 0 0 0 0 0 0 0 0 1.6k 2.1k 3k 3.4k 2.7k 3.4k poi 0 0 0 0 0 0 0 0 0 0 0 0 501 498 83k 83k building 0 0 0 0 0 0 0 0 0 0 0 0 0 59k 92k 92k housenumber 0 0 0 0 0 0 0 0 0 0 0 0 0 0 35k 35k full tile 7.9k 4k 9.5k 6.5k 3.7k 6k 20k 42k 85k 203k 185k 135k 114k 128k 244k 244k gzipped 6.2k 3.5k 7.1k 5.2k 3.1k 4.8k 14k 29k 60k 149k 138k 98k 83k 91k 154k 154k 0:01:03 DEB [archive] - Max tile: 244k (gzipped: 154k) 0:01:03 DEB [archive] - Avg tile: 5.4k (gzipped: 4k) using weighted average based on OSM traffic 0:01:03 DEB [archive] - # tiles: 4,115,012 0:01:03 DEB [archive] - # features: 5,484,360 0:01:03 INF [archive] - Finished in 18s cpu:1m7s avg:3.6 0:01:03 INF [archive] - read 1x(3% 0.6s wait:17s done:1s) 0:01:03 INF [archive] - encode 4x(55% 10s wait:2s done:1s) 0:01:03 INF [archive] - write 1x(22% 4s wait:12s done:1s) 0:01:03 INF - Finished in 1m3s cpu:3m23s gc:1s avg:3.2 0:01:03 INF - FINISHED! 0:01:03 INF - 0:01:03 INF - ---------------------------------------- 0:01:03 INF - data errors: 0:01:03 INF - render_snap_fix_input 16,639 0:01:03 INF - osm_multipolygon_missing_way 389 0:01:03 INF - osm_boundary_missing_way 73 0:01:03 INF - merge_snap_fix_input 12 0:01:03 INF - osm_boundary_duplicate_member 2 0:01:03 INF - feature_centroid_if_convex_osm_invalid_multipolygon_empty_after_fix 2 0:01:03 INF - feature_polygon_osm_invalid_multipolygon_empty_after_fix 2 0:01:03 INF - omt_park_area_osm_invalid_multipolygon_empty_after_fix 1 0:01:03 INF - omt_fix_water_before_ne_intersect 1 0:01:03 INF - feature_point_on_surface_osm_invalid_multipolygon_empty_after_fix 1 0:01:03 INF - ---------------------------------------- 0:01:03 INF - overall 1m3s cpu:3m23s gc:1s avg:3.2 0:01:03 INF - lake_centerlines 2s cpu:5s avg:2.3 0:01:03 INF - read 1x(20% 0.5s done:2s) 0:01:03 INF - process 4x(0% 0s done:2s) 0:01:03 INF - write 1x(0% 0s done:2s) 0:01:03 INF - water_polygons 15s cpu:39s avg:2.7 0:01:03 INF - read 1x(43% 6s done:7s) 0:01:03 INF - process 4x(26% 4s wait:4s done:5s) 0:01:03 INF - write 1x(4% 0.5s wait:9s done:5s) 0:01:03 INF - natural_earth 6s cpu:12s avg:1.9 0:01:03 INF - read 1x(95% 6s) 0:01:03 INF - process 4x(13% 0.8s wait:6s) 0:01:03 INF - write 1x(0% 0s wait:6s) 0:01:03 INF - osm_pass1 2s cpu:7s avg:3.3 0:01:03 INF - read 1x(2% 0s wait:2s) 0:01:03 INF - parse 4x(32% 0.6s wait:1s) 0:01:03 INF - process 1x(70% 1s) 0:01:03 INF - osm_pass2 17s cpu:1m9s avg:3.9 0:01:03 INF - read 1x(0% 0s wait:10s done:8s) 0:01:03 INF - process 4x(74% 13s) 0:01:03 INF - write 1x(2% 0.4s wait:17s) 0:01:03 INF - ne_lakes 0s cpu:0s avg:0 0:01:03 INF - boundaries 0s cpu:0s avg:2.8 0:01:03 INF - agg_stop 0s cpu:0s avg:0 0:01:03 INF - sort 1s cpu:3s avg:2.5 0:01:03 INF - worker 1x(54% 0.7s) 0:01:03 INF - archive 18s cpu:1m7s avg:3.6 0:01:03 INF - read 1x(3% 0.6s wait:17s done:1s) 0:01:03 INF - encode 4x(55% 10s wait:2s done:1s) 0:01:03 INF - write 1x(22% 4s wait:12s done:1s) 0:01:03 INF - ---------------------------------------- 0:01:03 INF - archive 108MB 0:01:03 INF - features 281MB ```

Full logs: https://github.com/onthegomap/planetiler/actions/runs/9189106518

sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
78.6% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud