mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
1 stars 4 forks source link

migrate more backups to BackBlaze to reduce costs #291

Open rahulbot opened 4 months ago

rahulbot commented 4 months ago

Following up on #270, we want to continue migrating backups from S3 to B2. This should include:

2024-06-26: All production stacks (daily, 2022 csv and 2022 rss) are writing to both S3 and B2

philbudne commented 4 months ago
  1. Do you want to migrate old rss-fetcher PG dumps to B2?
  2. Thoughts about a retention policy (I wrote a program that can keep N of yearly, monthly, weekly (sunday), daily dumps
  3. Re: RSS files: make a public (mediacloud-public) bucket? subdir names: daily-rss (for rss-fetcher), legacy-rss (for legacy system)???
  4. web app PG dumps: migrate old? retention policy (see 1 & 2 above)
rahulbot commented 4 months ago
  1. Old rss-fetcher PG dumps: I don't think we need them. Thought perhaps good to grab a handful and transfer them for longevity: perhaps the first of each month in 2024 so far?
  2. Retention: For rss-fetcher and web-app my first thought is: last ∞ yearly (ie. all), last 6 monthly, last 8 weekly, last 30 daily. Totally open to alternatives.
  3. synthetic rss-files: You suggestion sounds great. It isn't codified anywhere, but I do feel we have a responsibility to keep our "daily discovered url" files available publicly in perpetuity. And to be honest moving the server from s3 to b2 will surface any users we don't know about that are consuming them, which will be good to know about.
  4. web-app dumps: I'd treat it the same way as I suggest for (1) and (2) above - ie. grab a reasonable set of monthly-s to migrate and then apply same retention policy.
philbudne commented 4 months ago

Today I:

I tried creating a public mediacloud-public bucket, but the API call failed with an error that the account email address had not been verified.

philbudne commented 4 months ago

Regarding WARC files:

There are about 10K WARC files, taking up 1.8TB on ramos (November 2023 thru early March 2024). There are about 75K WARC files in the S3 mediacloud-indexer-archive bucket taking about 13TB

So we could be talking about $1000 to transfer the WARC files we don't have locally

philbudne commented 4 months ago

Now writing new current-day WARC files to both B2 and S3

rahulbot commented 4 months ago

I tried creating a public mediacloud-public bucket, but the API call failed with an error that the account email address had not been verified.

@philbudne I was able to poke around the settings page and verify my email. Please test again at your convenience and let me know if still fails.

philbudne commented 3 months ago

Did a bit of googling on how to set ES to use a specific S3 API URL for backblaze:

https://github.com/elastic/elasticsearch/issues/21283#issuecomment-828002399

B2 has S3 compatible API. It works fine for us. We are using a snapshot like this:

{
  "type": "s3",
  "settings": {
    "bucket": "elastic-backup",
    "region": "",
    "endpoint": "s3.us-west-001.backblazeb2.com"
  }
}

In our case the endpoint would be s3.us-east-005.backblazeb2.com

pgulley commented 2 months ago

I've broken out the task of "closing s3 writes" into a new issue (#316)- I'll leave this as a reference to the longer-term task of extracting data from s3 once we're no longer writing to it.