Open stitch opened 2 years ago
For the database, it seems like we just have a lot of data in there:
table_schema | table_name | row_estimate | total | index | toast | table | total_size_share
--------------------+-----------------------------+--------------+------------+------------+------------+------------+----------------------
public | checks_domaintesttls | 5.87121e+06 | 11 GB | 920 MB | 53 MB | 9811 MB | 0.19757024975081
public | checks_domaintestdnssec | 6.55164e+06 | 7461 MB | 445 MB | 8192 bytes | 7016 MB | 0.136694670574313
public | checks_domaintestipv6 | 3.17311e+06 | 6796 MB | 73 MB | 64 kB | 6724 MB | 0.124513858155647
public | checks_domaintestappsecpriv | 2.92914e+06 | 5258 MB | 132 MB | 38 MB | 5089 MB | 0.0963380988068091
public | checks_webtesttls | 2.90879e+06 | 3567 MB | 66 MB | 4848 kB | 3497 MB | 0.0653475402496535
public | checks_nsdomain | 1.31668e+07 | 3371 MB | 1758 MB | 8192 bytes | 1612 MB | 0.0617551144726996
public | checks_webtestappsecpriv | 1.84553e+06 | 3324 MB | 43 MB | 11 MB | 3271 MB | 0.0609007700979183
public | checks_mailtestauth | 1.36015e+06 | 2776 MB | 31 MB | 728 kB | 2744 MB | 0.0508495924509679
public | checks_mailtestipv6 | 1.49707e+06 | 2603 MB | 34 MB | 8192 bytes | 2569 MB | 0.0476891337242888
public | checks_mailtestdnssec | 1.43883e+06 | 2563 MB | 32 MB | 8192 bytes | 2531 MB | 0.0469626046496716
public | checks_connectiontest | 824699 | 1974 MB | 226 MB | 8192 bytes | 1748 MB | 0.0361587050950356
public | checks_mailtesttls | 1.35696e+06 | 1307 MB | 30 MB | 1200 kB | 1275 MB | 0.0239416807268268
public | checks_webdomain | 2.99028e+06 | 611 MB | 253 MB | 32 kB | 359 MB | 0.0111963626599552
In checks_domaintesttls, presumably our oldest test, there are 9 million rows. 4.6M are from before 2021 though, so if we want to save disk space, expiring tests to an inactive history after 2+ years could help.
Outside of SQL db, it seems our host makes daily database backups for the last 3 days as gzipped sql, which accounts for another 21GB. Perhaps we can save something there. Other /var things aren't significant compared to this.
Discussed: we should bring up archiving terms in the steering committee.
For batch, table sizes are (first query from here):
table_schema | table_name | row_estimate | total | index | toast | table | total_size_share
--------------------+-----------------------------+--------------+------------+------------+------------+---------+----------------------
public | checks_domaintestdnssec | 8.23058e+06 | 12 GB | 1307 MB | 8192 bytes | 10 GB | 0.144176948726993
public | checks_domaintesttls | 6.01754e+06 | 10 GB | 1131 MB | 16 MB | 9158 MB | 0.12387370636111
public | checks_mailtestauth | 3.11716e+06 | 6909 MB | 244 MB | 3688 kB | 6661 MB | 0.0830458788931248
public | checks_domaintestappsecpriv | 3.35727e+06 | 6450 MB | 329 MB | 24 MB | 6097 MB | 0.0775309044043542
public | checks_domaintestipv6 | 2.72941e+06 | 6402 MB | 228 MB | 8192 bytes | 6174 MB | 0.0769519574511083
public | checks_mailtestdnssec | 3.13744e+06 | 5948 MB | 245 MB | 8192 bytes | 5703 MB | 0.0715014982189282
public | checks_mailtestipv6 | 3.13552e+06 | 5857 MB | 264 MB | 8192 bytes | 5593 MB | 0.0704057718522659
public | checks_webtestappsecpriv | 2.83874e+06 | 5155 MB | 221 MB | 10 MB | 4924 MB | 0.0619640397406468
public | checks_mailtesttls | 4.07792e+06 | 4353 MB | 318 MB | 2960 kB | 4033 MB | 0.0523275076150541
public | checks_nsdomain | 1.6678e+07 | 4171 MB | 1802 MB | 8192 bytes | 2369 MB | 0.0501343645256617
public | checks_webtesttls | 2.83329e+06 | 3815 MB | 226 MB | 1504 kB | 3587 MB | 0.0458550403225049
public | checks_batchmailtest | 5.13604e+06 | 3496 MB | 3014 MB | | 482 MB | 0.0420172746877443
public | checks_batchdomain | 8.54135e+06 | 3170 MB | 2469 MB | | 701 MB | 0.0381079506461104
public | checks_batchwebtest | 3.50514e+06 | 2561 MB | 2233 MB | | 328 MB | 0.0307784728271506
public | checks_mailtestreport | 3.57264e+06 | 925 MB | 473 MB | 8192 bytes | 452 MB | 0.011123106378615
public | checks_domaintestreport | 2.60565e+06 | 635 MB | 320 MB | 8192 bytes | 316 MB | 0.00763777998400548
public | checks_webdomain | 2.57425e+06 | 415 MB | 120 MB | 8192 bytes | 295 MB | 0.00498495395282254
public | checks_mxdomain | 2.30569e+06 | 414 MB | 113 MB | 8192 bytes | 301 MB | 0.00498176105802767
public | checks_batchrequest | 80158 | 206 MB | 142 MB | | 64 MB | 0.00247871935619221
The distribution by year for single TLS (largest table):
internetnl=> select extract(year from timestamp) as year, count(*) from checks_domaintesttls group by year order by year;
year | count
------+---------
2015 | 32311
2016 | 132845
2017 | 823911
2018 | 1287284
2019 | 1350664
2020 | 978265
2021 | 1195942
2022 | 1119097
Therefore, a rough estimate is that by only keeping the last 2 years in the active database, we can cut the records down to 30% of what they are now.
The distribution by year for batch dnssecc (largest table):
internetnl=> select extract(year from timestamp) as year, count(*) from checks_domaintestdnssec group by year order by year;
year | count
------+---------
2020 | 994479
2021 | 3714144
2022 | 3923974
So a lot less to be gained here. Though we could set the limit for batch to one year, cutting it to 45%.
In December we agreed in a retention term for 2 years in the steering committee.
We discussed this several times and came up with a number of storage mechanisms:
The main issue is that none of these formats are particularly friendly to making multiple exports in the future and merging them. However, a PostgreSQL dump is probably the friendliest to future schema changes (and contains the current schema version in the django_migrations table data), so that is probably the best option for now.
Another inconvenience is that these dumps are hard to filter for specific dates, whereas after the dump, we probably only want to delete records beyond a certain age.
The production instance uses 80 gig in /var and 50 gig for the database. This will continue to grow at an unknown rate, for reasons that might be nonsensical (such as excessive logging that is never pruned, or storing metrics that are not useful anymore).
A quick win could be to index the disk usage with
ncdu
and check if there is useless junk. Next is to inspect the database to see what can be reduced. As an ultimate approach historic reports could be deleted, but exactly this history is reported as being (very) valuable for some end users.