manzt / quak

a scalable data profiler
https://manzt.github.io/quak/
MIT License
240 stars 11 forks source link

Arrow `string_view` raises DuckDB `NotImplementedException` #41

Closed manzt closed 2 months ago

manzt commented 2 months ago

23 led to a regression for string data types, specifically with the string_view Arrow data type not being recognized by DuckDB. I'm going to add some tests to ensure our Arrow/DuckDB connection code works, as we should have end-to-end tests to catch this.

cc: @kylebarron

import polars as pl
import pyarrow as pa
import quak

movies = pl.read_json("movies.json")
print(pa.table(movies).schema)
quak.Widget(movies.select("Director"))
Title: string_view
US Gross: int64
Worldwide Gross: int64
US DVD Sales: int64
Production Budget: int64
Release Date: string_view
MPAA Rating: string_view
Running Time min: int64
Distributor: string_view
Source: string_view
Major Genre: string_view
Creative Type: string_view
Director: string_view
Rotten Tomatoes Rating: int64
IMDB Rating: double
IMDB Votes: int64
Director: string_view
---------------------------------------------------------------------------
NotImplementedException                   Traceback (most recent call last)
Cell In[9], line 7
      5 movies = pl.read_json("movies.json")
      6 print(pa.table(movies).schema)
----> 7 quak.Widget(movies.select("Director"))

File ~/github/manzt/quak/src/quak/_widget.py:60, in Widget.__init__(self, data, table)
     55     else:
     56         raise ValueError(
     57             "input must be a DuckDB connection, DataFrame-like, an Arrow IPC "
     58             "table, or an Arrow object exporting the Arrow C Stream interface."
     59         )
---> 60     conn.register(table, arrow_table)
     61 self._conn = conn
     62 super().__init__(
     63     _table_name=table,
     64     _columns=get_columns(conn, table),
     65     temp_indexes=True,
     66     sql=f'SELECT * FROM "{table}"',
     67 )

NotImplementedException: Not implemented Error: Unsupported Internal Arrow Type vu
kylebarron commented 2 months ago

😢

I'd argue this is a DuckDB bug, since DuckDB says it supports Arrow input, and so it should check for these data types. The Arrow PyCapsule Interface actually has a mechanism to fix this: it allows data consumers to check the input schema and ask the producer to cast the data to a desired output schema.

For now, it shouldn't be too hard to manually work around this using pyarrow APIs

manzt commented 2 months ago

All good!

For now, it shouldn't be too hard to manually work around this using pyarrow APIs

So once we get back the table we should cast some columns?

kylebarron commented 2 months ago

Yeah, we should be able to check for any string view (probably also binary view, and maybe run-end-array) and cast those to duckdb-supported types. It would be nice if there were a resource in DuckDB that said which Arrow types it supports.

manzt commented 2 months ago

Yeah, we should be able to check for any string view (probably also binary view, and maybe run-end-array) and cast those to duckdb-supported types.

Ok working on a PR!

manzt commented 2 months ago

Looks like it's coming from here:

https://github.com/duckdb/duckdb/blob/4e3a192ce94a793510f11b598805f104d7531c15/src/function/table/arrow.cpp#L80

kylebarron commented 2 months ago

Oh so the latest version (maybe unreleased) of duckdb should support it: https://github.com/duckdb/duckdb/blame/4e3a192ce94a793510f11b598805f104d7531c15/src/function/table/arrow.cpp#L88-L89

manzt commented 2 months ago

Acually, it looks like it's here lol https://github.com/duckdb/duckdb/blob/4e3a192ce94a793510f11b598805f104d7531c15/src/function/table/arrow.cpp#L88-L89

manzt commented 2 months ago

Oh... yup lol you beat me to it

kylebarron commented 2 months ago

I made a little note in my argument for DuckDB to support the interface: https://github.com/duckdb/duckdb/discussions/10716#discussioncomment-10339017

manzt commented 2 months ago

Made a PR, but the implementation isn't working since the string_view -> string cast doesnt' seem to be supported. I'll try to dig into it later... but it would suck if we needed to pull the data into a pylist first.

kylebarron commented 2 months ago

😱 how is string_view -> string not a valid cast??

I would suggest temporarily using arro3 to do the cast, but arrow-rs won't support view types until the next major release (in ~one month)

The other option to support polars specifically for now is to call polars' to_arrow method, which allows Polars to cast to non-view array types if you set the CompatLevel as such.

manzt commented 2 months ago

😱 how is string_view -> string not a valid cast??

Lol i know...

I would suggest temporarily using arro3 to do the cast, but arrow-rs won't support view types until the next major release (in ~one month)

I would defer to you here as the arrow guru :) what do you think would be the best choice for the time being?

I could definitely do the polars patch today... but i am not familiar enough with the arro3 yet :)

kylebarron commented 2 months ago

Just checking re https://github.com/duckdb/duckdb/discussions/10716#discussioncomment-10339331 you're on the latest DuckDB version?

kylebarron commented 2 months ago

It turns out that was a simple repro. This fails:

import polars as pl
import duckdb
import pyarrow as pa

df = pl.DataFrame({"a": ["a", "b", "c"]})
table = pa.table(df)
duckdb.from_arrow(table)
---------------------------------------------------------------------------
NotImplementedException                   Traceback (most recent call last)
File /Users/kyle/github/kylebarron/arro3/tests/core/test_ffi.py:[1](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/tests/core/test_ffi.py:1)
----> 1 duckdb.from_arrow(table)

File ~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:505, in from_arrow(arrow_object, **kwargs)
    [503](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:503) else:
    [504](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:504)     conn = duckdb.connect(":default:")
--> [505](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:505) return conn.from_arrow(arrow_object, **kwargs)

NotImplementedException: Not implemented Error: Unsupported Internal Arrow Type vu

With polars 1.4.1, duckdb 1.0.0, pyarrow 17.0.0.

manzt commented 2 months ago

Just checking re duckdb/duckdb#10716 (reply in thread) you're on the latest DuckDB version?

Yes, and just tried out with nightly 1.0.1.dev4096

kylebarron commented 2 months ago

I would defer to you here as the arrow guru :) what do you think would be the best choice for the time being?

I don't think there's a good way today to handle string view -> string data type casting. I suppose the best workaround right now is to hard-code support for polars, check for a polars.DataFrame object, and call its to_arrow() method.

Even though this goes against what I want with the pycapsule interface, which is for consumers to not have to think about where the data is coming from 😛

arro3 isn't capable of this until the next arrow-rs release

manzt commented 2 months ago

Done, see #42

kylebarron commented 2 months ago

Yes, and just tried out with nightly 1.0.1.dev4096

Oh it works for me with this same nightly

image image

And the upstream was closed for being fixed on latest main https://github.com/duckdb/duckdb/issues/13424

manzt commented 2 months ago

ok maybe I didn't restart my jupyter kernel :(

manzt commented 2 months ago

Yup, can confirm that 1.0.1.dev4096 is working:

image

I'll revert #42 after the next python release