kraina-ai / quackosm

QuackOSM: an open-source Python and CLI tool for reading OpenStreetMap PBF files using DuckDB
https://kraina-ai.github.io/quackosm/
Apache License 2.0
207 stars 12 forks source link

Avoid downloading neighboring geometries #110

Closed shishkin closed 3 months ago

shishkin commented 6 months ago

When specifying Monaco to get a geometry and then use that geometry to download and convert OSM into parquet, quackosm downloads 346 MB of files/Geofabrik_provence-alpes-cote-d-azur.osm.pbf instead of 527 KB of the actual Monaco PBF.

I also tried the same with Regierungsbezirk Düsseldorf. Quackosm downloads neighboring Münster and Köln. That is almost 325 MB more than just 190 MB asked.

When downloading Germany, Quackosm also downloads Denmark, Austria, and Czechia.

Is there a way to avoid downloading unneeded OSM files?

RaczeQ commented 6 months ago

Hello @shishkin For now, this is the expected behaviour, because QuackOSM tries to cover given geometry fully and extracts geometries not always line-up perfectly with geocoded ones.

Monaco

I've plotted the geocoded geometry for clause Monaco (https://www.openstreetmap.org/relation/1124039) in yellow, and Geofabrik extract geometry for Monaco (http://download.geofabrik.de/europe/monaco.html) in red. image

As you can see, there is a huge chunk of sea area that is returned by Nominatim, that isn't covered by extract from Geofabrik.

But, changing the PBF source from geofabrik to osmfr will solve the issue for your use case: image

import quackosm as qosm
import osmnx as ox

qosm.convert_geometry_to_geodataframe(
    geometry_filter=ox.geocode_to_gdf("Monaco").unary_union, osm_extract_source="osmfr"
)
quackosm --geom-filter-geocode Monaco --osm-extract-source osmfr

Düsseldorf

image Here switching to osmfr source can also help.

Germany

image Again osmfr source can also help.

Summary

By default QuackOSM uses only Geofabrik extracts, because scraping BBBike and OSMfr takes a long time to do, but these services could contain better matching geometries for particular use cases. Also, Geofabrik has better coverage of the whole world than OpenStreetMap.fr, but they don't have enough buffer around extracts to fully cover Nominatim-based geometries.

Looking at those examples, I think I can fix the issue regarding Germany and Düsseldorf case for Geofabrik default source, by discarding new extracts if their contribution to overall geometry is insignificant (for example less than 1% of the queried geometry).

OSM_fr index - better precision in particular areas, but some gaps outside Europe image

Geofabrik index - more uniform coverage image

shishkin commented 6 months ago

I see. I'm actually confused by what you call "Nominatim-based geometries". Aren't all geometries coming from OSM unchanged, where Nomimatim is a search index and Geofabrik, osmfr and others are just repackaging the same OSM world.pbf in smaller pieces? I get that the nature of boundaries is very complicated, but so far Geofabrik slicing seem quite practical. I would actually even prefer to just specify names of Geofabrik extracts directly (like duesseldorf-regbez) in order to reuse quackosm's caching.

RaczeQ commented 6 months ago

Nominatim can be a source of truth, but all of those services can define their geometries and names. BBBike for example serves rectangular extracts around cities detached from administrative geometries.

I've added two issues to tackle the problems mentioned here:

RaczeQ commented 3 months ago

@shishkin I have finally had time to implement the changes suggested in this issue.

I've refactored the whole mechanism for covering the geometry with extracts and I also changed the default extract source from Geofabrik to any (Geofabrik + OpenStreetMap France + BBBike), so more extracts are now available from the start.

Searching for the OSM extracts using text query has also been implemented.

Monaco example - geometry

quackosm --geom-filter-geocode Monaco
files/251dc266735356127b3d8a1a13af7c4472f375263a64ac31b2fabb7ccfb11b3e_nofilter_compact.parquet # only single file downloaded

Monaco example - osm extract

quackosm --osm-extract-query Monaco
Multiple extracts matched by query "Monaco".
Matching extracts full names: "geofabrik_europe_monaco", "osmfr_europe_monaco". # error - multiple extracts found

quackosm --osm-extract-query Monaco --osm-extract-source Geofabrik
files/geofabrik_europe_monaco_nofilter_noclip_compact.parquet

Düsseldorf example - geometry

quackosm --geom-filter-geocode "Regierungsbezirk Düsseldorf"
/mnt/c/Development/Python/quackosm/quackosm/osm_extracts/__init__.py:588: GeometryNotCoveredWarning: Skipping extract because of low IoU value (bbbike_koeln, 0.000366).
  warnings.warn(
/mnt/c/Development/Python/quackosm/quackosm/osm_extracts/__init__.py:588: GeometryNotCoveredWarning: Skipping extract because of low IoU value (bbbike_moenchengladbach, 3.31e-06).
  warnings.warn(
/mnt/c/Development/Python/quackosm/quackosm/osm_extracts/__init__.py:588: GeometryNotCoveredWarning: Skipping extract because of low IoU value (bbbike_bochum, 5.15e-07).
  warnings.warn(
100%|███████████████████████████████████████| 200M/200M [00:00<00:00, 1.03TB/s]
files/5edc8fe03eef110c6f10cc8d2ccde5f6aa48203f60adccbbba5e31c9dc6b25ed_nofilter_compact.parquet

Düsseldorf example - osm extract

quackosm --osm-extract-query Düsseldorf
Zero extracts matched by query "Düsseldorf".
Found full names close to query: "osmfr_europe_germany_nordrhein_westfalen_dusseldorf", "bbbike_duesseldorf". # error - multiple extracts found

quackosm --osm-extract-query osmfr_europe_germany_nordrhein_westfalen_dusseldorf # running again with full name
files/osmfr_europe_germany_nordrhein_westfalen_dusseldorf_nofilter_noclip_compact.parquet
shishkin commented 3 months ago

@RaczeQ That's awesome! Thanks a lot. Look forward to test the new version.