intake / akimbo

For when your data won't fit in your dataframe
https://akimbo.readthedocs.io
BSD 3-Clause "New" or "Revised" License
21 stars 6 forks source link

pandas nested data infeasible operations #17

Open mroeschke opened 1 year ago

mroeschke commented 1 year ago

pandas generally discourages having nested data in a DataFrame or Series. For nested data in pandas, I tend to group the types of nested data as:

  1. Array-like (N-D)
    
    In [4]: nested_array_like = pd.Series([[1, 2], [2, 3]])

In [5]: nested_array_like Out[5]: 0 [1, 2] 1 [2, 3] dtype: object


The only behavior somewhat defined and tested for array-like (python `list` specifically) is addition which acts like an `append`

In [14]: nested_array_like + nested_array_like Out[14]: 0 [1, 2, 1, 2] 1 [2, 3, 2, 3] dtype: object


And there is `explode` which encourages users to `unnest` their data

In [15]: nested_array_like.explode() Out[15]: 0 1 0 2 1 2 1 3 dtype: object


Some operations I have seen tried by users with array-like data is:

* `groupby` the array-like values
* element-wise operations (e.g. add 2 to each element in the array)
* reduction-wise operations per array-like value (e.g. `sum` each array)
* indexing/selecting/slicing the array-like values
* containment operations (e.g. 2 in each array -> True/False)

2. Key-Value-like

In [6]: nested_kv_like = pd.Series([{1:2}, {2:3}])

In [7]: nested_kv_like Out[7]: 0 {1: 2} 1 {2: 3} dtype: object


The only behavior supported and tested for dict-like is `dict.get` via the `str` accessor (which is somewhat strange IMO)

In [13]: nested_kv_like.str.get(1) Out[13]: 0 2.0 1 NaN dtype: float64



A lot of the same operations described above I've seen users try with key-value-like data except specifically treating the `keys` or `values` as "arrays"
martindurant commented 1 year ago

These are certainly worthwhile processing models that we will chase. However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow?

We are finding that nested/ragged data just doesn't show up a lot in python exactly because no one knows what to do with them - even though they are ubiquitous in the real world. We could probably do something interesting with the likes of https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs , for instance. We have the following specific cases in mind for examples:

Any other suggestions?

martindurant commented 1 year ago

https://pythonspeed.com/articles/json-memory-streaming/ a smallish example we can directly compare to; takes 23MB for ak in memory, but a very complicated typestring.

mroeschke commented 1 year ago

However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow?

Ah I see. Sorry I am not too familiar of public-ish datasets/workflows for this case.