mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Consider various approaches to reducing AWS cost #260

Closed rahulbot closed 8 months ago

rahulbot commented 8 months ago

We want to reduce the ongoing monthly off-site storage costs as much as possible. Two main tasks here: (1) audit AWS bill/use to reduce cost as much as possible and (2) consider alternative off-site storage services.

Notes on the first task (audit AWS):

Notes on the second task (consider alternatives):

Sharing monthly and/or yearly numbers would be great.

philbudne commented 8 months ago

The " Idle public IPv4 address per hour" $4 item is almost certainly an unused "elastic" (static) IP address

philbudne commented 8 months ago

500GB for each new month seems to be a good estimate for WARC files:

root@ramos:/srv/data/docker/indexer/worker_data/archiver# du -hs 202?/[01]*
291G    2023/11
473G    2023/12
455G    2024/01
480G    2024/02
98G 2024/03

Multiplying that out: 12*500G/month / 1024 => 5.9T/year From web-search, historic data looks like it has higher volume, so those numbers may be low, but given that the currently loaded data (and associated WARC files) are almost 12 months, I think 6TB/year is a reasonable figure for WARC growth.

thepsalmist commented 8 months ago
  1. On the pending IPV4 address, I had cleanup the others on 07/12/2023, the pending IP raises an error as below (locked to AWS account). Will need to raise a ticket with AWS for this

3.222.XXX.232: The address with allocation id [eipalloc-01f7f53e7bb256fe3] cannot be released because it is locked to your account. Please contact AWS Support to unlock it

  1. On the EBS snapshots Snapshots that can be deleted include: mediacloud-icinga - this was attached to the icinga instance mediacloud-herewegoagain-frontend - was attached to the frontend EC2 instance, frontend apps were long migrated mediacloud-herewegoagain-data-dokku mediacloud-herewegoagain-misc-v1 - was root volume of misc instance (docker swarm maneger)

The mediacloud-corrupt-postgresql-XXX are snapshots from the Postgres chunks, Database B, C, D. The required stories from these had been extracted to the respective S3 buckets. Unless there's a foreseeable use/need to look at these database sections, then these can be deleted.

.

thepsalmist commented 8 months ago

Deleted S3 buckets as per the Excel file.

thepsalmist commented 8 months ago

@thepsalmist to do cost analysis ion Backblaze

thepsalmist commented 8 months ago

Backblaze sorage cost $6 per TB/month

Total storage cost S3 = 109.2TB * ^/TB = $655.2

Current AWS cost S3 = $2438.60 (Mar 2024)

Based on the Transfer requests Tier 1 397,127 & Tier 2 109,969,437 = $43.69 vs AWS $45.98

rahulbot commented 8 months ago

FYI: AWS responded on the DTO request saying it is all-or-nothing. To get free you have to take everything out of AWS, which is confusing. The result is that in the short term we won't get any credits to support re-indexing costs.

Closing this issue as no longer active because we've taken actions or split off to new to-do items.