migrate more backups to BackBlaze to reduce costs

rahulbot commented 4 months ago

Following up on #270, we want to continue migrating backups from S3 to B2. This should include:

[x] rss-fetcher postgres backups (old files migrated, new files written to B2)
[x] start writing production WARC files to B2
[x] start writing 2022 CSV backfill WARC files to B2
[x] start writing 2022 RSS backfill WARC files to B2
[x] stop writing production WARC files to S3
[x] stop writing 2022 backfill WARC files to S3
[ ] transfer old story-indexer archive (WARC) files, some files at ramos:/srv/data/docker/indexer/worker_data/archiver/
[x] create public mediacloud-public bucket, requires verified email address
[x] transfer rss-fetcher synthetic RSS files: files in tarbell:/space/dokku/data/storage/rss-fetcher-storage/rss-output-files/
[x] transfer historic synthetic RSS files: files in tarbell:/space/S3/mediacloud-public/daily-rss/
[x] web-app postgres backups (old files migrated, new files written to B2)
[ ] ES snapshots
[ ] other mish-mash of historical files on S3?

2024-06-26: All production stacks (daily, 2022 csv and 2022 rss) are writing to both S3 and B2

philbudne commented 4 months ago

Do you want to migrate old rss-fetcher PG dumps to B2?
Thoughts about a retention policy (I wrote a program that can keep N of yearly, monthly, weekly (sunday), daily dumps
Re: RSS files: make a public (mediacloud-public) bucket? subdir names: daily-rss (for rss-fetcher), legacy-rss (for legacy system)???
web app PG dumps: migrate old? retention policy (see 1 & 2 above)

rahulbot commented 4 months ago

Old rss-fetcher PG dumps: I don't think we need them. Thought perhaps good to grab a handful and transfer them for longevity: perhaps the first of each month in 2024 so far?
Retention: For rss-fetcher and web-app my first thought is: last ∞ yearly (ie. all), last 6 monthly, last 8 weekly, last 30 daily. Totally open to alternatives.
synthetic rss-files: You suggestion sounds great. It isn't codified anywhere, but I do feel we have a responsibility to keep our "daily discovered url" files available publicly in perpetuity. And to be honest moving the server from s3 to b2 will surface any users we don't know about that are consuming them, which will be good to know about.
web-app dumps: I'd treat it the same way as I suggest for (1) and (2) above - ie. grab a reasonable set of monthly-s to migrate and then apply same retention policy.

philbudne commented 4 months ago

Today I:

created a B2 mediacloud-mcweb-backup folder
migrated selected rss-fetcher PG backups from S3 to B2. files at tarbell:/space/S3/mediacloud-rss-fetcher-backup/xfer
migrated selected mcweb PG backups from S3 to B2. files at tarbell:/space/S3/mediacloud-mcweb-backup/xfer
changed mcweb PG backups to B2 and tested
downloaded legacy RSS files to tarbell:/space/S3/mediacloud-public/daily-rss/
updated the checklist at the top of this issue

I tried creating a public mediacloud-public bucket, but the API call failed with an error that the account email address had not been verified.

philbudne commented 4 months ago

Regarding WARC files:

There are about 10K WARC files, taking up 1.8TB on ramos (November 2023 thru early March 2024). There are about 75K WARC files in the S3 mediacloud-indexer-archive bucket taking about 13TB

So we could be talking about $1000 to transfer the WARC files we don't have locally

philbudne commented 4 months ago

Now writing new current-day WARC files to both B2 and S3

rahulbot commented 4 months ago

I tried creating a public mediacloud-public bucket, but the API call failed with an error that the account email address had not been verified.

@philbudne I was able to poke around the settings page and verify my email. Please test again at your convenience and let me know if still fails.

philbudne commented 3 months ago

Did a bit of googling on how to set ES to use a specific S3 API URL for backblaze:

https://github.com/elastic/elasticsearch/issues/21283#issuecomment-828002399

B2 has S3 compatible API. It works fine for us. We are using a snapshot like this:

{
  "type": "s3",
  "settings": {
    "bucket": "elastic-backup",
    "region": "",
    "endpoint": "s3.us-west-001.backblazeb2.com"
  }
}

In our case the endpoint would be s3.us-east-005.backblazeb2.com

pgulley commented 2 months ago

I've broken out the task of "closing s3 writes" into a new issue (#316)- I'll leave this as a reference to the longer-term task of extracting data from s3 once we're no longer writing to it.

mediacloud / story-indexer

migrate more backups to BackBlaze to reduce costs #291