Improve skip key - Githubissues

visualfabriq / bquery

A query and aggregation framework for Bcolz (W2013-01)

https://www.visualfabriq.com

BSD 3-Clause "New" or "Revised" License

56 stars 11 forks source link

Improve skip key #47

Open CarstVaartjes opened 9 years ago

CarstVaartjes commented 9 years ago

It's a slightly ugly method to handle irrelevant results (because of filtering) during a groupby:

finding it can be slow as it's now (now cythonized)
getmask would be nicer; also have to look how bcolz iterblocks handles filters
how it's removed at the end (manipulating a list of ndarrays) is not performance efficient

waylonflinn commented 9 years ago

Would whereblocks be a good way to handle this? I'm working on a PR that will add pivot table style aggregation, and I'm pursuing a method that would make extensive use of the existing filtering functionality. If there are speed gains to be had, I'd like to pursue them.

CarstVaartjes commented 9 years ago

whereblocks aren't cython based; @FrancescElies his work on iterblocks will work (and make it a lot more readable) but we still have to apply the filter ourselves :/ we could do a np.getmask on the chunk array but not sure if that will make it quicker or not than just looping through it. Also depends on the amount that you filter out of course

waylonflinn commented 9 years ago

I found this excellent investigation of timings from @FrancescElies: Comment in Chunks Class iterator, PR 153

It looks like he's just using the top level iterblocks though: test_v5 in bench_iter_carray

Isn't that defined in python here? iterblocks in toplevel.py

waylonflinn commented 9 years ago

I'm also happy to put together a PR that reproduces his earlier PR, since that will likely be hard to merge. Would that be helpful?

CarstVaartjes commented 8 years ago

@francescalted also did some work to improve performance here (also with tuples vs namedtuples), I will check if we can use that to improve this and save ourselves unneeded chunk decompressions