Closed mroeschke closed 3 weeks ago
These are certainly worthwhile processing models that we will chase. However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow?
We are finding that nested/ragged data just doesn't show up a lot in python exactly because no one knows what to do with them - even though they are ubiquitous in the real world. We could probably do something interesting with the likes of https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs , for instance. We have the following specific cases in mind for examples:
Any other suggestions?
https://pythonspeed.com/articles/json-memory-streaming/ a smallish example we can directly compare to; takes 23MB for ak in memory, but a very complicated typestring.
However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow?
Ah I see. Sorry I am not too familiar of public-ish datasets/workflows for this case.
pandas generally discourages having nested data in a
DataFrame
orSeries
. For nested data in pandas, I tend to group the types of nested data as:In [5]: nested_array_like Out[5]: 0 [1, 2] 1 [2, 3] dtype: object
In [14]: nested_array_like + nested_array_like Out[14]: 0 [1, 2, 1, 2] 1 [2, 3, 2, 3] dtype: object
In [15]: nested_array_like.explode() Out[15]: 0 1 0 2 1 2 1 3 dtype: object
In [6]: nested_kv_like = pd.Series([{1:2}, {2:3}])
In [7]: nested_kv_like Out[7]: 0 {1: 2} 1 {2: 3} dtype: object
In [13]: nested_kv_like.str.get(1) Out[13]: 0 2.0 1 NaN dtype: float64