visualfabriq / bquery

A query and aggregation framework for Bcolz (W2013-01)
https://www.visualfabriq.com
BSD 3-Clause "New" or "Revised" License
56 stars 11 forks source link

Filter optimization #72

Closed CarstVaartjes closed 8 years ago

CarstVaartjes commented 8 years ago

Moved all filtering in where_terms to cython, which should improve the performance complex "in" filters significantly

CarstVaartjes commented 8 years ago

So this works well, but actually makes it slower than faster, though it does improve memory usage quite a bit... have to study a bit more on it

CarstVaartjes commented 8 years ago

Ok, so it can be significantly slower for non-complex queries (2-5x) but for larger queries it's faster (there's a break even point etc) and it's much less memory consuming, so i'll merge it. an example of a non-complex

import numpy as np
import bquery
ct = bquery.ctable(rootdir='fact_promotion_data_internal_product.bcolz', mode='r')
where_terms = [['d_n_33', 'in', [106, 47]], ['d_n_34', '==', 17], ['d_n_40', 'in', [1, 2, 3]]]
%memit
%memit ct.where_terms(where_terms)
%memit
%timeit ct.where_terms(where_terms)

# old situation
peak memory: 230.82 MiB, increment: 0.00 MiB
peak memory: 235.40 MiB, increment: 4.58 MiB
peak memory: 235.41 MiB, increment: 0.00 MiB
10 loops, best of 3: 28 ms per loop

# new situation
peak memory: 230.45 MiB, increment: 0.00 MiB
peak memory: 233.18 MiB, increment: 2.73 MiB
peak memory: 233.18 MiB, increment: 0.00 MiB
10 loops, best of 3: 138 ms per loop

If you make the where_terms much larger the new situation outperforms the old situation