visualfabriq / bquery

A query and aggregation framework for Bcolz (W2013-01)
https://www.visualfabriq.com
BSD 3-Clause "New" or "Revised" License
56 stars 11 forks source link

Query transformer infrastructure & example query transformer implementations #29

Open ARF1 opened 9 years ago

ARF1 commented 9 years ago

Based on visualfabriq/bquery#27 but can be rebased on master.

This introduces the infrastructure for plug-in query transformers. Included are three sample query transformers:

By default this PR does not change the behaviour or dependencies of bquery. Query transformers have to be explicitly enabled by configuring them, e.g.:

from transformers import InOperatorTransformer, TrivialBooleanExpressionsOptimizer
b.transformers = [InOperatorTransformer(), TrivialBooleanExpressionsOptimizer()]

For convenience, a shortcut is provided for these (currently) most useful transformers with transformers.standard_transformers:

from transformers import standard_transformers
b.transformers = standard_transformers

The overhead for queries is negligible for reasonably sized databases: For the query db["my_col=='AB1234567890'"] bquery without query transformers requires 362 ms, with all query transformers configured (including CachedFactorOptimizer) 367 ms.

With a non-compressed database the CachedFactorOptimizer shows some minor positive effects: 547 ms vs. 296 ms

CarstVaartjes commented 9 years ago

Hi @ARF1

Sorry no one ever got back to you before! :( We used to work like the inoperatortransformer before, but with larger in statements it broke numexpr (too many or's); so we had to implement this workaround. In a short mail discussion with Francesc Alted he suggested that the best thing to do was to add in/not in functionality to numexpr. But that needs some heavy C coding (not my personal forte and my programmers are also quite overloaded atm). Still, it's on the to do list as it will greatly improve filtering (you would be able to push everything directly to numexpr) The factorization part is a very good idea, i'll see how to automate that from a filter behaviour