Correct agg ops - Githubissues

CarstVaartjes commented 9 years ago

An update that:

greatly should improve memory usage for large (multi-billion) aggregations; it now directly compresses each individual groupby step
massively improved the groupby call to actually do what the documentation states (reshuffled the order of calls to make it more logical and sql-like)
prepares for multi-threading (groupby and aggregation column generation should be able to be combined now and then run parallel; for next version)

CarstVaartjes commented 9 years ago

Hi @waylonflinn this might affect you a bit (but for the better I hope! it should give more control over what the group by does and also you will be able to combine multiple types of aggregations now in 1 groupby call) Performance is similar / not impacted but the memory footprint should be significantly smaller (didn't call gc.collect though)

waylonflinn commented 9 years ago

Great! Excited to have a look.

CarstVaartjes commented 9 years ago

the changes are quite modest ;) but it really should help. and important: prepares for multi-threading

waylonflinn commented 9 years ago

I'm already noticing useful things. Like I'd already decided that output_col needed to be added to the agg_ops return value, and you've taken care of that.

I think the extension agg method needs to use both, instead of just the one. Also, I think the output column data type should be dependent on the aggregation operation. Here's a table describing the relationships I've come up with.

data types for each aggregation operation

operation	datatype
SUM	same as input
COUNT	`int`
COUNT_NA	`int`
COUNT_DISTINCT	`int`
SORTED_COUNT_DISTINCT	`int`
MEAN	`float`
STDEV	`float`
MEDIAN	same as input

Here's another table describing which column's datatype relevant variables should use:

sum_type usage

variable name	column
in_buffer	input
out_buffer	output
last_values	input
v	input
countunique{{ sum_type }}	input

Let me know if this seems right to you.

CarstVaartjes commented 9 years ago

seems spot on!

visualfabriq / bquery

Correct agg ops #49

data types for each aggregation operation

sum_type usage