toni-moreno / syncflux

SyncFlux is an Open Source InfluxDB Data synchronization and replication tool for migration purposes or HA clusters
MIT License
154 stars 34 forks source link

[Feature Request] Divide data into chunks based on amount rather than time #43

Open ptoews opened 4 years ago

ptoews commented 4 years ago

When I tried to sync a large database I experienced a few errors, for example Request Entitiy too Large which I could not fix yet by increasing the max-points-on-write parameter, and similar issues have been discussed here already for large amounts of data. But this is not the main point of this issue.

My data consists of ~50k points which are contained within about one minute, and I tried to sync the last month. So to decrease the amount of points per chunk, I would have to choose a chunk-interval of a few seconds, which results in a huge amount of empty chunks for this month. So I wondered: what is the reason for dividing the data based on time, instead of actual amount? Granted, my example is a bit extreme, but in cases where the data distribution is uneven or has spikes this approach might not be the best. Instead it might be better to be able to define a chunk size, for example 1000 points, and then syncflux queries the first 1000 points, then the next 1000 points, and so on, resulting in very even and adjustable chunk sizes. InfluxQL does support this with the LIMIT and OFFSET clauses.

I cannot even think of a reason why aggregating data over time would be better than simply over amount as described. Am I missing something? What do you think?

toni-moreno commented 4 years ago

Hello @ptoews , is a great idea data partitioning by amount of series, in addition to chuck of time data (It will enable complete control over the amount of data) we can add by example a "max-series-by-time-chunk" , I will have in mind in next releases, ( I will accept also PR's with these new feature).

Until then , did you test disabling the max-body-size parameter , perhaps could you help, in large read/writes.

Thank-you a lot @ptoews for your suggestion.

ptoews commented 3 years ago

Hi @toni-moreno great to hear that! I'm just wondering what the advantage of a time-based partitioning is? A combination is surely possible, but I don't see any use case for that. Neither for the time-only case. The current issues with large copies seem much more important to me.

Maybe you can explain this to me, otherwise I would try to implement an amount-only-based partitioning solution.