Bring down infra cost - Githubissues

space-meridian / roadmap

High-level roadmap for Filecoin Station

https://starmap.site/roadmap/github.com/space-meridian/roadmap/issues/1

0 stars 0 forks source link

Bring down infra cost #71

Open bajtos opened 4 months ago

bajtos commented 4 months ago

Tasks

[ ] Export all InfluxDB data from station bucket
[x] ~~Upload data export to w3s~~
[ ] Hardcode job count offset in queries
[ ] Turn on retention for all InfluxDB buckets, to bring down Storage cost

juliangruber commented 2 months ago

Asked on Slack whether the dashboard work will provide us with total amount of jobs performed. I believe this is the only metric that requires us to have infinite retention on the station bucket, which is the main cost factor

juliangruber commented 2 months ago

The dashboard work will replace that 👍 So we can turn on a bucket retention policy after these metrics land.

In order to implement some quick cost reductions, I propose we perform a manual purge:

Sum the total job count before April 1st 2024
Delete all measurements before April 1st 2024
Add the previously calculated sum as a fixed offset to Grafana charts and the website

Wdyt @bajtos @patrickwoodhead?

juliangruber commented 2 months ago

As suggested by @bajtos: Before deleting all measurements, store them in cold storage. Try compressing using https://facebook.github.io/zstd/.

juliangruber commented 1 month ago

[ ] Export measurements before May 1st from Influx
[ ] Store export
[ ] Enable 30d bucket retention policy
[ ] Sum total jobs completed
[ ] Add sum as fixed offset
- [ ] Grafana
- [ ] Website
[ ] Re-enable website query

bajtos commented 1 month ago

Export measurements before April 1st from Influx Enable 30d bucket retention policy

The 30d retention will delete all data older than April 13th if enabled today. Your export will contain only data older than April 1st. We will lose measurements recorded between April 1st and 13th.

Is that okay? Did I miss something?

juliangruber commented 1 month ago

Of course 😅 ok I will include more data in the export. Updated the task list to go up to May 1st (to be sure)

juliangruber commented 1 month ago

Script used for the export, currently running: https://gist.github.com/juliangruber/cd50f1227d08e8b94d6b4b36620b4711

juliangruber commented 1 month ago

The export finished. The resulting ndjson file is 58GB in size. Something strange is going on however: It's not including any records with "_measurement":"jobs-completed" and "_field":"value". This is what we need to recreate the job count. My suspicion is that the export is incomplete. In comparison, this query works:

Screenshot 2024-05-14 at 23 36 58

I'm going to try to export only these records. I'm also going to repeat the export of everything, to see if it creates a different file.

juliangruber commented 1 month ago

This export finished:

2024-05-14T23:32:57.724Z { rows: 23382470, jobs: 32774485 }

It suggests there were 32m jobs completed, while on the website we show 161m. I will repeat this export to see if it is deterministic.

juliangruber commented 1 month ago

Next run:

2024-05-15T13:40:09.512Z { rows: 23483927, jobs: 32840534 }

It's in the same ballpark, but not exact. Since no new events are being added to the old timeframe, this export mechanism is flawed. Let's check if we can do something else

juliangruber commented 1 month ago

Next run:

2024-05-16T12:16:11.297Z { rows: 20583385, jobs: 29386919 }

This time I used async iteration instead of the queryRows function. It looked at significantly less measurements.

juliangruber commented 1 month ago

I assume we can improve our chances by performing many queries, maybe one for each day. I will try this now

juliangruber commented 1 month ago

The oldest row it can find is from 2022-11-05. We landed the telemetry commit on Oct 31st (https://github.com/filecoin-station/desktop/commit/6d135e6c4e57f7f0e48be5bdf5be6a8eb62f28a1). I don't know what this means.

juliangruber commented 1 month ago

Tools for uploading big files to w3s:

juliangruber commented 1 month ago

downloading measurements from one day takes ~4 minutes
at the moment the script is at June 28th, 2023
it will process data until May 1st, 2024
that's 308 days left
the script is expected to finish in 21h / ~Thursday 23rd 8am CEST

juliangruber commented 1 month ago

The script ran until { day: 2023-11-17T00:00:00.000Z }, when we started receiving 429 Too Many Requests / org XYZ has exceeded limited_query plan limit.

I'm going to continue the script tomorrow with that date as the new starting point, and will merge the result with the previous export.

juliangruber commented 1 month ago

The 1TB disk instance ran out of space. It's currently on 2024-01-07. I'm resizing the machine to 2TB, removing the incomplete day from the export, and then will let it continue

juliangruber commented 1 month ago

The script currently is at 2024-01-29 (file size 1.4TB) and has until 2024-06-01 to run.

juliangruber commented 5 days ago

Up to 2024-03-03T17:14:30.236298505Z, there were 567,564,171 jobs recorded in InfluxDB. This is how far my script reached before getting rate limited again. I'm now going to destroy the machine that keeps this export.

juliangruber commented 5 days ago

I will now evaluate deleting these old rows, more work needs to be done before we can turn on a retention policy

juliangruber commented 5 days ago

I have deleted all rows from the station bucket that were recorded before 2024-03-03T17:14:30.236298505Z

juliangruber commented 4 days ago

from 2024-03-03T17:14:30.000Z to 2024-03-03T18:27:50.000Z there were 1_025_567 more jobs. I suspect we're getting rate limited again.

juliangruber commented 3 days ago

I have paused the script as even with a 1s window it was bringing down the Influx cluster