Closed manzt closed 2 months ago
😢
I'd argue this is a DuckDB bug, since DuckDB says it supports Arrow input, and so it should check for these data types. The Arrow PyCapsule Interface actually has a mechanism to fix this: it allows data consumers to check the input schema and ask the producer to cast the data to a desired output schema.
For now, it shouldn't be too hard to manually work around this using pyarrow APIs
All good!
For now, it shouldn't be too hard to manually work around this using pyarrow APIs
So once we get back the table we should cast some columns?
Yeah, we should be able to check for any string view (probably also binary view, and maybe run-end-array) and cast those to duckdb-supported types. It would be nice if there were a resource in DuckDB that said which Arrow types it supports.
Yeah, we should be able to check for any string view (probably also binary view, and maybe run-end-array) and cast those to duckdb-supported types.
Ok working on a PR!
Looks like it's coming from here:
Oh so the latest version (maybe unreleased) of duckdb should support it: https://github.com/duckdb/duckdb/blame/4e3a192ce94a793510f11b598805f104d7531c15/src/function/table/arrow.cpp#L88-L89
Acually, it looks like it's here lol https://github.com/duckdb/duckdb/blob/4e3a192ce94a793510f11b598805f104d7531c15/src/function/table/arrow.cpp#L88-L89
Oh... yup lol you beat me to it
I made a little note in my argument for DuckDB to support the interface: https://github.com/duckdb/duckdb/discussions/10716#discussioncomment-10339017
Made a PR, but the implementation isn't working since the string_view
-> string
cast doesnt' seem to be supported. I'll try to dig into it later... but it would suck if we needed to pull the data into a pylist first.
😱 how is string_view -> string not a valid cast??
I would suggest temporarily using arro3 to do the cast, but arrow-rs won't support view types until the next major release (in ~one month)
The other option to support polars specifically for now is to call polars' to_arrow method, which allows Polars to cast to non-view array types if you set the CompatLevel
as such.
😱 how is string_view -> string not a valid cast??
Lol i know...
I would suggest temporarily using arro3 to do the cast, but arrow-rs won't support view types until the next major release (in ~one month)
I would defer to you here as the arrow guru :) what do you think would be the best choice for the time being?
I could definitely do the polars patch today... but i am not familiar enough with the arro3 yet :)
Just checking re https://github.com/duckdb/duckdb/discussions/10716#discussioncomment-10339331 you're on the latest DuckDB version?
It turns out that was a simple repro. This fails:
import polars as pl
import duckdb
import pyarrow as pa
df = pl.DataFrame({"a": ["a", "b", "c"]})
table = pa.table(df)
duckdb.from_arrow(table)
---------------------------------------------------------------------------
NotImplementedException Traceback (most recent call last)
File /Users/kyle/github/kylebarron/arro3/tests/core/test_ffi.py:[1](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/tests/core/test_ffi.py:1)
----> 1 duckdb.from_arrow(table)
File ~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:505, in from_arrow(arrow_object, **kwargs)
[503](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:503) else:
[504](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:504) conn = duckdb.connect(":default:")
--> [505](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:505) return conn.from_arrow(arrow_object, **kwargs)
NotImplementedException: Not implemented Error: Unsupported Internal Arrow Type vu
With polars 1.4.1, duckdb 1.0.0, pyarrow 17.0.0.
Just checking re duckdb/duckdb#10716 (reply in thread) you're on the latest DuckDB version?
Yes, and just tried out with nightly 1.0.1.dev4096
I would defer to you here as the arrow guru :) what do you think would be the best choice for the time being?
I don't think there's a good way today to handle string view -> string data type casting. I suppose the best workaround right now is to hard-code support for polars, check for a polars.DataFrame
object, and call its to_arrow()
method.
Even though this goes against what I want with the pycapsule interface, which is for consumers to not have to think about where the data is coming from 😛
arro3 isn't capable of this until the next arrow-rs release
Done, see #42
Yes, and just tried out with nightly
1.0.1.dev4096
Oh it works for me with this same nightly
And the upstream was closed for being fixed on latest main https://github.com/duckdb/duckdb/issues/13424
ok maybe I didn't restart my jupyter kernel :(
Yup, can confirm that 1.0.1.dev4096
is working:
I'll revert #42 after the next python release
23 led to a regression for string data types, specifically with the string_view Arrow data type not being recognized by DuckDB. I'm going to add some tests to ensure our Arrow/DuckDB connection code works, as we should have end-to-end tests to catch this.
cc: @kylebarron