kraina-ai / quackosm

QuackOSM: an open-source Python and CLI tool for reading OpenStreetMap PBF files using DuckDB
https://kraina-ai.github.io/quackosm/
Apache License 2.0
210 stars 12 forks source link

cli: Add option to export to DuckDB #94

Closed mwip closed 1 month ago

mwip commented 6 months ago

I love your project enabling next-gen OSM analyses with the power of DuckDB. Wouldn't it be cool, to enable specifying a DuckDB file (and optionally table) to store the data in? I know DuckDB has a native OSM reader. But you abstracted to the convenience of Points, Lines and Polygons.

What I think would be a cool possible outcome is, e.g.

quackosm monaco-latest.osm.pbf --duckdb monaco_osm.duckdb 'osm'

instead of what's currently possible:

quackosm monaco-latest.osm.pbf
⚫◗ create table osm as
      select feature_id, tags, st_geomfromwkb(geometry) as geometry
      from read_parquet('files/monaco-latest_nofilter_noclip_compact.geoparquet');

If you think this is an interesting idea, I'd love to contribute. Maybe we could discuss possible entry points for where to start.

RaczeQ commented 6 months ago

Hello @mwip 👋🏻 I think that this might be an interesting idea. As you've mentioned, the current and probably the easiest option would be to just run a CREATE TABLE query after the whole process and automate it.

If I were to add such functionality, I'd add new functions internally next to the existing ones:

And new ones:

For this use case, the duckdb file could have a generated name, if not provided, just like with the geoparquet file and have the default table name like osm_features.

The easiest way to do this would be to copy get_features_gdf and get_features_gdf_from_geometry functions: https://github.com/kraina-ai/quackosm/blob/5d19b3c6024a79578d2543db07eadea139ccf81c/quackosm/pbf_file_reader.py#L431 https://github.com/kraina-ai/quackosm/blob/5d19b3c6024a79578d2543db07eadea139ccf81c/quackosm/pbf_file_reader.py#L533

And just replace Geoparquet to GeoDataFrame transform with the CREATE TABLE query to save it in the DuckDB table (and maybe automatically remove the generated geoparquet file after successful table creation).

The direct streaming of data to DuckDB could be achieved but would require more sophisticated changes in the process (mainly because of the duplicated rows removal when multiple PBF files are parsed at once).

Things to keep in mind:

For the CLI changes, I have few ideas:

Some additional CLI errors should probably be thrown if there are illogical combinations of parameters (such as the output file name with the geoparquet extension and the passed --duckdb or --output-table-name flags.

You are welcome to try and implement it, I'll try to give you more hints and help with docs (examples) and tests if needed.

mwip commented 6 months ago

Thanks considering this and providing the relevant starting points. I am a little busy, but plan on working on this eventually and provide a PR. If anyone beat me to this, that is fine with me. :smile:

mwip commented 6 months ago

I was finally able to start working on this and hope I'm able to provide a PR soon.