visualfabriq / bquery

A query and aggregation framework for Bcolz (W2013-01)
https://www.visualfabriq.com
BSD 3-Clause "New" or "Revised" License
56 stars 11 forks source link

Correct agg ops #49

Closed CarstVaartjes closed 9 years ago

CarstVaartjes commented 9 years ago

An update that:

CarstVaartjes commented 9 years ago

Hi @waylonflinn this might affect you a bit (but for the better I hope! it should give more control over what the group by does and also you will be able to combine multiple types of aggregations now in 1 groupby call) Performance is similar / not impacted but the memory footprint should be significantly smaller (didn't call gc.collect though)

waylonflinn commented 9 years ago

Great! Excited to have a look.

CarstVaartjes commented 9 years ago

the changes are quite modest ;) but it really should help. and important: prepares for multi-threading

waylonflinn commented 9 years ago

I'm already noticing useful things. Like I'd already decided that output_col needed to be added to the agg_ops return value, and you've taken care of that.

I think the extension agg method needs to use both, instead of just the one. Also, I think the output column data type should be dependent on the aggregation operation. Here's a table describing the relationships I've come up with.

data types for each aggregation operation

operation datatype
SUM same as input
COUNT int
COUNT_NA int
COUNT_DISTINCT int
SORTED_COUNT_DISTINCT int
MEAN float
STDEV float
MEDIAN same as input

Here's another table describing which column's datatype relevant variables should use:

sum_type usage

variable name column
in_buffer input
out_buffer output
last_values input
v input
countunique{{ sum_type }} input

Let me know if this seems right to you.

CarstVaartjes commented 9 years ago

seems spot on!