ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.3k stars 537 forks source link

bug(geo-duckdb): ibis.read_parquet() binary cast to geometry #9076

Closed ncclementi closed 2 weeks ago

ncclementi commented 2 weeks ago

What happened?

In [1]: import ibis
   ...: from ibis import _
   ...: ibis.options.interactive = True
   ...: 
   ...: parquet = "https://data.source.coop/cboettig/pad-us-3/pad-us3-combined.parquet"
   ...: table = ibis.read_parquet(parquet)

In [2]: t = table.mutate(geo_col=_.geometry.cast("geometry"))
In [3]: t.geo_col
CatalogException: Catalog Error: Scalar Function with name "st_geomfromwkb" is not in the catalog, but it exists in the spatial extension.

Please try installing and loading the spatial extension:
INSTALL spatial;
LOAD spatial;

What version of ibis are you using?

main

What backend(s) are you using, if any?

duckdb

Relevant log output

No response

Code of Conduct

ncclementi commented 2 weeks ago

@cpcloud you mentioned the to_sqlglot method in the compiler would be a good place to fit in the logic to load the spatial extension, but we don't have that method in the duckdb compiler. In the base compiler under backends/sql/compiler.py I only see this

https://github.com/ibis-project/ibis/blob/1926eb40ae4a8aadb4899645745145247636056c/ibis/backends/sql/compiler.py#L644-L648

I think Ideally we should be able to load the spatial extension when the casting is to a geometry type.

The amgic happens here, but here there is no con to be able to load the extension.

https://github.com/ibis-project/ibis/blob/1926eb40ae4a8aadb4899645745145247636056c/ibis/backends/duckdb/compiler.py#L340-L341

cpcloud commented 2 weeks ago

We definitely can't and shouldn't load extensions in the compiler, that would mean you could never use the compiler without having already connected to DuckDB.

This has to happen in the backend, just before execution.

You can do this in a _to_sqlglot override in the duckdb Backend class:

def _to_sqlglot(self, ...):
    if any geospatial types:
        self.load_extensions(["geospatial"])  # don't remember the exact API

    return super()._to_sqlglot(...)