Closed CarstVaartjes closed 9 years ago
Hi @waylonflinn this might affect you a bit (but for the better I hope! it should give more control over what the group by does and also you will be able to combine multiple types of aggregations now in 1 groupby call) Performance is similar / not impacted but the memory footprint should be significantly smaller (didn't call gc.collect though)
Great! Excited to have a look.
the changes are quite modest ;) but it really should help. and important: prepares for multi-threading
I'm already noticing useful things. Like I'd already decided that output_col
needed to be added to the agg_ops
return value, and you've taken care of that.
I think the extension agg
method needs to use both, instead of just the one.
Also, I think the output column data type should be dependent on the aggregation operation. Here's a table describing the relationships I've come up with.
operation | datatype |
---|---|
SUM | same as input |
COUNT | int |
COUNT_NA | int |
COUNT_DISTINCT | int |
SORTED_COUNT_DISTINCT | int |
MEAN | float |
STDEV | float |
MEDIAN | same as input |
Here's another table describing which column's datatype relevant variables should use:
variable name | column |
---|---|
in_buffer | input |
out_buffer | output |
last_values | input |
v | input |
countunique{{ sum_type }} | input |
Let me know if this seems right to you.
seems spot on!
An update that: