We receive quite a few usage reports from Syncthing instances. This data is good and valuable. However, our way of handling it is naive and inefficient, a relic from when there were a few thousand Syncthing users and not upwards of a million.
Currently, reports are stored in JSON format in a PostgreSQL database. There are a few aggregation tables where per-day aggregated data is stored, but several queries run against the latest 24 hours of raw reports. There are several issues with this:
We accumulate an enormous amount of data that we don't need and don't want (the historical raw reports).
There is no way to answer many historical questions ("what percentage of users used versioning in July 2019?") without doing an expensive custom query on those historical raw reports.
It's tricky to scale since it depends on a single large PostgreSQL database.
If I did it from scratch today, I would probably have done something more like:
Incoming reports are saved as-is as blobs in a blob store
Once every 24 hours all the reports from the previous day are aggregated and rolled up into a statistics object, containing counts, histograms, max/min/median/percentiles/stddev/etc statistics for all the numerical values, etc.
Optionally, further aggregations could produce the time series for users-over-the-last-two-years etc., but these are essentially a kind of cache over the daily objects.
All the previous day reports are deleted.
Thus, we would keep a tiny fraction of the data, and most queries could be answered by loading a single daily object or time series object. Those only change every 24 hours so can be cached in RAM. All the server instances are essentially stateless, just putting or occasionally getting objects from an object store (e.g., S3).
We receive quite a few usage reports from Syncthing instances. This data is good and valuable. However, our way of handling it is naive and inefficient, a relic from when there were a few thousand Syncthing users and not upwards of a million.
Currently, reports are stored in JSON format in a PostgreSQL database. There are a few aggregation tables where per-day aggregated data is stored, but several queries run against the latest 24 hours of raw reports. There are several issues with this:
If I did it from scratch today, I would probably have done something more like:
Thus, we would keep a tiny fraction of the data, and most queries could be answered by loading a single daily object or time series object. Those only change every 24 hours so can be cached in RAM. All the server instances are essentially stateless, just putting or occasionally getting objects from an object store (e.g., S3).