Open AMDmi3 opened 4 hours ago
Note: it turned out we already have deduplication of fresh entries in repositories_history_new
.
We also don't really want to store historical data on per-update granularity - it also makes sense to leave ~one point per day for history older than a year.
And we can do the same for total statistics. Not that it would conserve much space, but for consistency at the very least.
Repositories history (the one used for graphs) is currenty stored as timestamped json of all per-repository counters, like this:
At a first glance, it looks like JSON overhead here is tremendous and converting the counters into classic SQL columns, also splitting these into per-repository entries would be much more optimal. In practice that's not the case, and doing so produces table of more than 2x size (988 MB jsonb, 2316 MB columns), not counting index sizes. I assume that jsonb variant benefits from TOAST compression and from lack of postgresql row overhead (with 381 active repositories, that's around 9kb per history point).
However, the conversion allows to handle each repository independently, which makes lookups faster (no need to fetch counters for 400 repositories for only one of them), and also allows to drop history points which did not have any counter changes. The latter allows to shrink the table to 387 MB (+145 MB index), at the cost of need for periodic cleanup, or more complex history update logic (if previous history point is the same, update its timestamp instead of adding a new one), which I'd like to avoid to implement with current sql-based updater.
Some more tests are needed, but it looks like the columns variant is more viable. At the very least, we should pick one as currently both histories are stored which takes precious disk space.
Migration query:
Cleanup query:
Migration without deduplication then cleanup produces the same result as deduplicating migration. However cleanup on 2GB table takes very long time, so it may be faster if it's limited with repository. Cleanup on mostly deduplicated table is fast, and is even faster if limited with age, so it's safe to run periodically or after the update.
Additional thing: for this to work, SQL queries for /graph/repo/* endpoints should be tweaked to fetch additional row just before desired range of 21 day, as the history is now sparse and the graph may now only contain one recent data point, with which it's not possible to plot any lines, so we need an extra point (in fact, we need it always not to have a gap at the beginning of the graph).