intake / akimbo

For when your data won't fit in your dataframe
https://akimbo.readthedocs.io
BSD 3-Clause "New" or "Revised" License
34 stars 6 forks source link

experimental for nested #79

Open martindurant opened 2 months ago

martindurant commented 2 months ago

@jpivarski : the query function here works on the play data generated by nested-pandas in 10x the speed compared to the typical approach we discussed, even with the UnmaskedArray PR.

Generate the play data:

from nested_pandas.datasets import generate_data
import awkward as ak
import akimbo.pandas
import akimbo.exp  # this PR, experimental

nf = generate_data(1000, 10000)  # 10 rows, 100 nested rows per row
arr = nf.ak.array
arr2 = akimbo.exp.rec_list_swap(arr, "nested")  # to list-of-records

Times:

%timeit nf_g = nf.query("nested.t > 17.0");
83.8 ms ± 351 µs
%timeit arr["nested"][arr["nested", "t"] > 17]
183 ms ± 1.56 ms
%timeit akimbo.exp.query(arr2, "nested.t > 17")
23.2 ms ± 568 µs

Note that here we make a masked array, so it has exactly the same structure as the original (swapped) array, but where the filter fails, you get None. Else you would need ak.count, which takes about 50ms.

It feels like it should be possible to do this really efficiently with ArrayBuilder and numba? You would need to have a way to turn the "query" into something you can execute in the loop.

jpivarski commented 2 months ago

If all of the functors are structured, like map, filter, reduce, then you can do better than ArrayBuilder by making the Numba-compiled function generate an index and then apply that index to the array as a slice.

You could also add an axis argument to this and have it apply at some depth using ak.transform (having all structure above where it's applied stay the same—but the transformation has to be length-preserving). That would solve a whole class of problems in which someone wants to take apart a structure, change something, and then rebuild everything above the changed part the same way.