vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.25k stars 590 forks source link

[BUG-REPORT] DataFrame.astype(pyarrow.DictionaryArray(...)) is broken #2187

Open NickCrews opened 2 years ago

NickCrews commented 2 years ago

Description I was trying to cast a DF to a pyarrow schema and ran across this issue. DataFrame[column].astype(pyarrow_type) works fine for many pyarrow types such as float, string, bool. But it doesn't work for pyarrow.lib.DictionaryType. However, if I use vaex's "wrapped" version, it works fine. I didn't explore other dtypes, but perhaps this also reveals a problem with other complex pyarrow types?

See the xfail-ing test PR

(PS is there hope for a future DataFrame.astype() similar to pandas? I'm writing this myself and feels like I'm reinventing the wheel.)

Software information

maartenbreddels commented 2 years ago

I like the idea of having a a dtype for the dataframe, we already kind of do checks for that in DataFrame.__array__, and we may possibly have some code for this in https://github.com/vaexio/vaex/pull/415 (where we want to know if a dataframe is of a homogenous type, so it can be treated as a 'matrix').

What should astype(some_dict_type) do actually?