influxdata / influxdb-csharp

A .NET library for efficiently sending points to InfluxDB 1.x
Apache License 2.0
199 stars 60 forks source link

Implement Aggregation of Increment and Time measurements #55

Open cypressious opened 6 years ago

cypressious commented 6 years ago

Fixes #9

nblumhardt commented 6 years ago

@cypressious thanks again for the time spent on this. It's a nicely-implemented PR, but I'm not keen to move forward with it because I don't think we'll ultimately be able to build on a buffer-then-aggregate design without an eventual reset to avoid the allocations. Let me know if you're keen to dig into it further :+1: Thanks!

cypressious commented 6 years ago

@nblumhardt I will rework the PR. I'll implement a similar mechanism to the current batching approach where data will be periodically aggregated and then flushed down the pipeline.

cypressious commented 6 years ago

@nblumhardt I've finished the rework.

I've reimplemented the aggregation to run periodically, just like the batching. They also both derrive from the same base class now.

Because the aggregation now runs periodically, it needs to run before the batching in the pipeline.

I have one concern with the current implementation. Because points are allowed to have arbitrary timestamps, points that are buffered during each tick aren't guaranteed to really be from the same interval. That's why I still need to group them into time buckets. However, time buckets aren't necessarily aligned at the times when the aggregation happens. So points from one timer interval will most likely fall into two time buckets, even though the interval and bucket "length" is the same.

In practice, this means that we will (almost) always have twice the number of transmitted points.

I'll see if I can come up with a solution.


~I've squashed the commits and rebased the PR on #59 so you should probably merge that one first.~

cypressious commented 6 years ago

@nblumhardt I think I've found a solution for the problem described above. Points within the timer interval, i.e. inside [now - interval, now] now always land inside the same bucket. Remaining points are grouped into buckets like before.

I think this PR is ready for review.

cypressious commented 6 years ago

@nblumhardt did you have time to look at the PR?

nblumhardt commented 6 years ago

Hi @cypressious - thanks for the follow-up and sorry about the slow response; my time for this project is currently low.

Just to clarify my last comment:

I'm not keen to move forward with it because I don't think we'll ultimately be able to build on a buffer-then-aggregate design without an eventual reset to avoid the allocations. Let me know if you're keen to dig into it further

By buffer-then-aggregate, I mean, collecting up the PointData and then aggregating in memory with collection operations, rather than allocating a bucket per aggregation key and performing the aggregation incrementally in-place. It won't always be able to be done without overhead, but for counters/sums, and gauges, the total memory usage required to hold these can potentially just be a value, not a collection.

By an eventual reset, I mean, if we don't accommodate this goal into the design from the start, we probably won't be able to do it in a future version without starting afresh.

I think to dig in further requires some discussion of the goals/design space etc., and probably needs another contributor or two to join the conversation to provide some perspective.

cypressious commented 6 years ago

@nblumhardt Thanks for the reply. I misunderstood your last comment but I think I get your point now.

For sums, the bucket approach seems ideal.

Other kinds of aggregations can be done too. For averages, we could store the sum and the count to compute sum/count.

I'm still interested in implementing this feature. Do you have suggestions on which contributors to get involved?

nblumhardt commented 6 years ago

👍 I'll put the call out over on the original (#9) ticket. Cheers!