Open bajtos opened 4 months ago
Asked on Slack whether the dashboard work will provide us with total amount of jobs performed. I believe this is the only metric that requires us to have infinite retention on the station
bucket, which is the main cost factor
The dashboard work will replace that 👍 So we can turn on a bucket retention policy after these metrics land.
In order to implement some quick cost reductions, I propose we perform a manual purge:
Wdyt @bajtos @patrickwoodhead?
As suggested by @bajtos: Before deleting all measurements, store them in cold storage. Try compressing using https://facebook.github.io/zstd/.
Export measurements before April 1st from Influx Enable 30d bucket retention policy
The 30d retention will delete all data older than April 13th if enabled today. Your export will contain only data older than April 1st. We will lose measurements recorded between April 1st and 13th.
Is that okay? Did I miss something?
Of course 😅 ok I will include more data in the export. Updated the task list to go up to May 1st (to be sure)
Script used for the export, currently running: https://gist.github.com/juliangruber/cd50f1227d08e8b94d6b4b36620b4711
The export finished. The resulting ndjson file is 58GB in size. Something strange is going on however: It's not including any records with "_measurement":"jobs-completed"
and "_field":"value"
. This is what we need to recreate the job count. My suspicion is that the export is incomplete. In comparison, this query works:
I'm going to try to export only these records. I'm also going to repeat the export of everything, to see if it creates a different file.
This export finished:
2024-05-14T23:32:57.724Z { rows: 23382470, jobs: 32774485 }
It suggests there were 32m jobs completed, while on the website we show 161m. I will repeat this export to see if it is deterministic.
Next run:
2024-05-15T13:40:09.512Z { rows: 23483927, jobs: 32840534 }
It's in the same ballpark, but not exact. Since no new events are being added to the old timeframe, this export mechanism is flawed. Let's check if we can do something else
Next run:
2024-05-16T12:16:11.297Z { rows: 20583385, jobs: 29386919 }
This time I used async iteration instead of the queryRows
function. It looked at significantly less measurements.
I assume we can improve our chances by performing many queries, maybe one for each day. I will try this now
The oldest row it can find is from 2022-11-05. We landed the telemetry commit on Oct 31st (https://github.com/filecoin-station/desktop/commit/6d135e6c4e57f7f0e48be5bdf5be6a8eb62f28a1). I don't know what this means.
Tools for uploading big files to w3s:
The script ran until { day: 2023-11-17T00:00:00.000Z }
, when we started receiving 429 Too Many Requests
/ org XYZ has exceeded limited_query plan limit
.
I'm going to continue the script tomorrow with that date as the new starting point, and will merge the result with the previous export.
The 1TB disk instance ran out of space. It's currently on 2024-01-07
. I'm resizing the machine to 2TB, removing the incomplete day from the export, and then will let it continue
The script currently is at 2024-01-29
(file size 1.4TB) and has until 2024-06-01
to run.
Up to 2024-03-03T17:14:30.236298505Z
, there were 567,564,171 jobs recorded in InfluxDB. This is how far my script reached before getting rate limited again. I'm now going to destroy the machine that keeps this export.
I will now evaluate deleting these old rows, more work needs to be done before we can turn on a retention policy
I have deleted all rows from the station
bucket that were recorded before 2024-03-03T17:14:30.236298505Z
from 2024-03-03T17:14:30.000Z
to 2024-03-03T18:27:50.000Z
there were 1_025_567 more jobs. I suspect we're getting rate limited again.
I have paused the script as even with a 1s window it was bringing down the Influx cluster
Tasks
Upload data export to w3s