cli: Add option to export to DuckDB

mwip commented 6 months ago

I love your project enabling next-gen OSM analyses with the power of DuckDB. Wouldn't it be cool, to enable specifying a DuckDB file (and optionally table) to store the data in? I know DuckDB has a native OSM reader. But you abstracted to the convenience of Points, Lines and Polygons.

What I think would be a cool possible outcome is, e.g.

quackosm monaco-latest.osm.pbf --duckdb monaco_osm.duckdb 'osm'

instead of what's currently possible:

quackosm monaco-latest.osm.pbf
⚫◗ create table osm as
      select feature_id, tags, st_geomfromwkb(geometry) as geometry
      from read_parquet('files/monaco-latest_nofilter_noclip_compact.geoparquet');

If you think this is an interesting idea, I'd love to contribute. Maybe we could discuss possible entry points for where to start.

RaczeQ commented 6 months ago

Hello @mwip 👋🏻 I think that this might be an interesting idea. As you've mentioned, the current and probably the easiest option would be to just run a CREATE TABLE query after the whole process and automate it.

If I were to add such functionality, I'd add new functions internally next to the existing ones:

convert_pbf_to_gpq - takes downloaded pbf file and returns geoparquet (or parquet in wkt form)
convert_geometry_to_gpq - takes geometry, downloads pbf file(s) and returns geoparquet (or parquet in wkt form)
get_features_gdf - takes downloaded pbf file and returns geodataframe
get_features_gdf_from_geometry - takes geometry, downloads pbf file(s) and returns geodataframe

And new ones:

convert_pbf_to_duckdb - takes downloaded pbf file and saves it in a given duckdb file and table name
convert_geometry_to_duckdb - takes geometry, downloads pbf file(s) and saves it in a given duckdb file and table name

For this use case, the duckdb file could have a generated name, if not provided, just like with the geoparquet file and have the default table name like osm_features.

The easiest way to do this would be to copy get_features_gdf and get_features_gdf_from_geometry functions: https://github.com/kraina-ai/quackosm/blob/5d19b3c6024a79578d2543db07eadea139ccf81c/quackosm/pbf_file_reader.py#L431 https://github.com/kraina-ai/quackosm/blob/5d19b3c6024a79578d2543db07eadea139ccf81c/quackosm/pbf_file_reader.py#L533

And just replace Geoparquet to GeoDataFrame transform with the CREATE TABLE query to save it in the DuckDB table (and maybe automatically remove the generated geoparquet file after successful table creation).

The direct streaming of data to DuckDB could be achieved but would require more sophisticated changes in the process (mainly because of the duplicated rows removal when multiple PBF files are parsed at once).

Things to keep in mind:

DuckDB file could not exist, so it should be created (will be created by Python API if not exist, but only final file, not the directories)
If file and table exist, the query shouldn't run (overridden with ignore_cache flag parameter)
The changes in the API would be required in 3 places: PbfFileReader, functions.py file and cli.py file.

For the CLI changes, I have few ideas:

Plain --duckdb flag - data saved in the file with a generated name in the default table name osm_features

quackosm monaco-latest.osm.pbf --duckdb
Finished operation in 0:00:00
files/monaco-latest_nofilter_noclip_compact.duckdb::osm_features

Passed result file with .duckdb or .db extension (--duckdb flag would be optional, inferred from the output file extension)
```
quackosm monaco-latest.osm.pbf --output monaco.duckdb
Finished operation in 0:00:00
monaco.duckdb::osm_features
```

Passed table name - saved in a given table name

quackosm monaco-latest.osm.pbf --output monaco.duckdb --output-table-name osm
Finished operation in 0:00:00
monaco.duckdb::osm

Some additional CLI errors should probably be thrown if there are illogical combinations of parameters (such as the output file name with the geoparquet extension and the passed --duckdb or --output-table-name flags.

You are welcome to try and implement it, I'll try to give you more hints and help with docs (examples) and tests if needed.

mwip commented 6 months ago

Thanks considering this and providing the relevant starting points. I am a little busy, but plan on working on this eventually and provide a PR. If anyone beat me to this, that is fine with me. :smile:

mwip commented 6 months ago

I was finally able to start working on this and hope I'm able to provide a PR soon.

kraina-ai / quackosm

cli: Add option to export to DuckDB #94