Closed mwip closed 1 month ago
Hello @mwip 👋🏻
I think that this might be an interesting idea. As you've mentioned, the current and probably the easiest option would be to just run a CREATE TABLE
query after the whole process and automate it.
If I were to add such functionality, I'd add new functions internally next to the existing ones:
convert_pbf_to_gpq
- takes downloaded pbf file and returns geoparquet (or parquet in wkt form)convert_geometry_to_gpq
- takes geometry, downloads pbf file(s) and returns geoparquet (or parquet in wkt form)get_features_gdf
- takes downloaded pbf file and returns geodataframeget_features_gdf_from_geometry
- takes geometry, downloads pbf file(s) and returns geodataframeAnd new ones:
convert_pbf_to_duckdb
- takes downloaded pbf file and saves it in a given duckdb file and table nameconvert_geometry_to_duckdb
- takes geometry, downloads pbf file(s) and saves it in a given duckdb file and table nameFor this use case, the duckdb
file could have a generated name, if not provided, just like with the geoparquet file and have the default table name like osm_features
.
The easiest way to do this would be to copy get_features_gdf
and get_features_gdf_from_geometry
functions:
https://github.com/kraina-ai/quackosm/blob/5d19b3c6024a79578d2543db07eadea139ccf81c/quackosm/pbf_file_reader.py#L431
https://github.com/kraina-ai/quackosm/blob/5d19b3c6024a79578d2543db07eadea139ccf81c/quackosm/pbf_file_reader.py#L533
And just replace Geoparquet to GeoDataFrame transform with the CREATE TABLE
query to save it in the DuckDB table (and maybe automatically remove the generated geoparquet file after successful table creation).
The direct streaming of data to DuckDB could be achieved but would require more sophisticated changes in the process (mainly because of the duplicated rows removal when multiple PBF files are parsed at once).
Things to keep in mind:
ignore_cache
flag parameter)PbfFileReader
, functions.py
file and cli.py
file.For the CLI changes, I have few ideas:
--duckdb
flag - data saved in the file with a generated name in the default table name osm_features
quackosm monaco-latest.osm.pbf --duckdb
Finished operation in 0:00:00
files/monaco-latest_nofilter_noclip_compact.duckdb::osm_features
.duckdb
or .db
extension (--duckdb
flag would be optional, inferred from the output file extension)
quackosm monaco-latest.osm.pbf --output monaco.duckdb
Finished operation in 0:00:00
monaco.duckdb::osm_features
quackosm monaco-latest.osm.pbf --output monaco.duckdb --output-table-name osm
Finished operation in 0:00:00
monaco.duckdb::osm
Some additional CLI errors should probably be thrown if there are illogical combinations of parameters (such as the output file name with the geoparquet
extension and the passed --duckdb
or --output-table-name
flags.
You are welcome to try and implement it, I'll try to give you more hints and help with docs (examples) and tests if needed.
Thanks considering this and providing the relevant starting points. I am a little busy, but plan on working on this eventually and provide a PR. If anyone beat me to this, that is fine with me. :smile:
I was finally able to start working on this and hope I'm able to provide a PR soon.
I love your project enabling next-gen OSM analyses with the power of DuckDB. Wouldn't it be cool, to enable specifying a DuckDB file (and optionally table) to store the data in? I know DuckDB has a native OSM reader. But you abstracted to the convenience of Points, Lines and Polygons.
What I think would be a cool possible outcome is, e.g.
instead of what's currently possible:
If you think this is an interesting idea, I'd love to contribute. Maybe we could discuss possible entry points for where to start.