vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] DataFrame describe method fails for List elements #2108

Open ghost opened 2 years ago

ghost commented 2 years ago

Following up on https://github.com/vaexio/vaex/issues/2087#issuecomment-1163799755, thevaex.dataframe.DataFrameLocal.describe() method does not work for list types. Seems those types are not included in map_arrow_to_numpy.

data = {"A": [1], "B": pa.array([["a", "b", "c"]])}
df = vaex.from_dict(data)
df.describe()

Traceback (most recent call last):
  File "Mac/python3.9/site-packages/vaex/array_types.py", line 327, in numpy_dtype_from_arrow_type
    return map_arrow_to_numpy[arrow_type]
KeyError: ListType(list<item: int64>)
maartenbreddels commented 2 years ago

What do you expect to see as output btw, just count/missing values? e.g. No statistics?

ghost commented 2 years ago

This is the behavior that I would expect.

>>> import pandas as pd
>>> data = {"A": [1], "B": [["a", "b", "c"]]}
>>> pd.DataFrame(data).describe()
         A
count  1.0
mean   1.0
std    NaN
min    1.0
25%    1.0
50%    1.0
75%    1.0
max    1.0
>>> import pandas as pd
>>> data = {"A": [[1, 2, 3]], "B": [["a", "b", "c"]]}
>>> pd.DataFrame(data).describe()
                A          B
count           1          1
unique          1          1
top     [1, 2, 3]  [a, b, c]
freq            1          1
>>> data = {"B": [["a", "b", "c"]]}
>>> pd.DataFrame(data).describe()
                B
count           1
unique          1
top     [a, b, c]
freq            1
JovanVeljanoski commented 2 years ago

This will add significant overhead to describe.. If you look at what describe currently outputs, the "count" field is the only one we have in common.

I once had the idea to have describe have additional arguments, so a user can specify if they want to have the n_unique elements, and maybe as you suggest the most_frequent and freq, which by default would be disabled.

I would be happy with that, even outside the context of lists. It would take some time/effort, and not sure how popular describe is.

In any case, feel free to open a PR on this!