Open CarstVaartjes opened 9 years ago
Would whereblocks
be a good way to handle this?
I'm working on a PR that will add pivot table style aggregation, and I'm pursuing a method that would make extensive use of the existing filtering functionality. If there are speed gains to be had, I'd like to pursue them.
whereblocks aren't cython based; @FrancescElies his work on iterblocks will work (and make it a lot more readable) but we still have to apply the filter ourselves :/ we could do a np.getmask on the chunk array but not sure if that will make it quicker or not than just looping through it. Also depends on the amount that you filter out of course
I found this excellent investigation of timings from @FrancescElies: Comment in Chunks Class iterator, PR 153
It looks like he's just using the top level iterblocks though: test_v5 in bench_iter_carray
Isn't that defined in python here? iterblocks in toplevel.py
I'm also happy to put together a PR that reproduces his earlier PR, since that will likely be hard to merge. Would that be helpful?
@francescalted also did some work to improve performance here (also with tuples vs namedtuples), I will check if we can use that to improve this and save ourselves unneeded chunk decompressions
It's a slightly ugly method to handle irrelevant results (because of filtering) during a groupby: