openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 12 forks source link

Set up US rendering server on AWS #682

Closed pnorman closed 2 years ago

pnorman commented 2 years ago

Ref #637

Outstanding questions

MarkRose commented 2 years ago

EBS, if using GP3, is generally great. If you're looking for inexpensive high IOPS, cheapest is to make a bunch of small GP3 volumes as each has a 3000 IOPS baseline (and RAID 0/LVM them or whatever). ST1 gives consistent hard drive like performance but latency is a bit higher than locally attached hard drives. If mod_tile is using blocking IO for reads, and I imagine mod_tile is, you may find you need fewer Apache threads/processes to get the same request throughput with GP3.

m6a/c6a/r6a isn't always a savings over m6i/c6i/r6i due to performance differences in memory. I've seen the Xeon chips work out significantly cheaper than Epyc in some situations. I'd benchmark both if the Gravitons don't work out.

Be prepared for your instance to fail. It just happens. Most instances will stay up for years, other will have hardware issues. Sometimes you'll get a warning in the Events of the EC2 console (and sent to email) where you'll have a few weeks to stop and start the instance. In other cases the recovery process will start the instance on new hardware. Any ephemeral stored data would of course be gone, so that's a big negative for using locally attached storage beyond a cache.

Just some thoughts from someone who has been using EC2 for over a decade.

Firefishy commented 2 years ago

@MarkRose Thank you for the helpful insights.

pnorman commented 2 years ago

ST1 gives consistent hard drive like performance but latency is a bit higher than locally attached hard drives.

Our sustained IOPS is 10k-20k, with peaks of 50k, so st1 isn't an option. My inclination is to start with a single maxed out gp3 and if necessary, split the tiles into their own volume.

The big unknown to me is latency, not iops. I don't know how that's going to impact performance.

grischard commented 2 years ago

Solving this will also solve #637

grischard commented 2 years ago

Depends on #660?

pnorman commented 2 years ago

Solving this will also solve https://github.com/openstreetmap/operations/issues/637

We're looking at replacing pyrene independent of this.

Depends on #660?

No, although they have some common parts for changing our account management

Firefishy commented 2 years ago

Account has been created. Accessible via assumed role from master account.

Firefishy commented 2 years ago

Design Considerations

  1. AWS Region
  2. Intel, AMD or ARM (Graviton)
  3. CPU cores
  4. RAM size
  5. PostgreSQL storage: Instance Store (local NVMe), EBS (GP2), EBS (GP3)
  6. PostgreSQL storage size
  7. PostgreSQL storage speed: IOPs and MiB/s (EBS GP3 mainly)
  8. Local tile cache storage: Instance Store (local NVMe), EBS (GP2), EBS (GP3)
  9. Local tile cache storage size
  10. Local tile cache storage speed: IOPs and MiB/s (EBS GP3 mainly)
  11. AWS Billing Alerts

Desired but not for initial launch:

Firefishy commented 2 years ago

Decisions:

  1. AWS Region: us-west-2
  2. TBC
  3. CPU cores: 64+
  4. RAM size: 250GB+
iandees commented 2 years ago

AWS Region: us-west-2

That's in Oregon probably very near the existing OSUOSL servers. Can I suggest us-east-2 (Ohio) or us-east-1 (Virginia) instead?

Firefishy commented 2 years ago

Instance choice for initial experiment: m6gd.16xlarge Instance Store (local NVMe)

grischard commented 2 years ago

The reasoning for us-west-2 was carbon neutrality, but https://sustainability.aboutamazon.com/environment/the-cloud?energyType=true says us-east-1 and us-east-2 are 95% powered by renewables too. Initial choice for AWS region: us-east-2.

Firefishy commented 2 years ago

AWS Region: us-east-2 Elastic IP: 3.144.0.72 Instance Name: palulukon Instance Type: m6gd.16xlarge

Firefishy commented 2 years ago

Initial basic AWS billing Budget created. $1000/month. Alerts me, ops and @grischard

grischard commented 2 years ago

DNS records created for palulukon.openstreetmap.org

Firefishy commented 2 years ago

Base chef is done, we're adding in arm64 for prometheus exporters.

Firefishy commented 2 years ago

Import is now running... Thank you to @pnorman

pnorman commented 2 years ago

Import completed. Pre-render took 1h47m, putting it at a comparable performance to culebre and nidhogg which have 2x 28 core AMD EPYC 7453. The new server is currently taking 38% of west-coast US load without issue and as its tile store gets populated, I'll be adding more load to it.

Firefishy commented 2 years ago

AWS credits cannot be used to buy Savings Plans or Reserved Instances (Partial Upfront or All Upfront). It looks like Reserved Instances No-Upfront are allowed, but that would leave OSMF exposed for potentially the last 2 months of the 12 month minimum reserved period (12 month is minimum period offered for this instance type). Reserved Instance pricing is available here.

EC2 + Bandwidth costs are currently around $115 per weekday which is sufficiently covered by the credits which expire on 30 September 2023 and allowing some headroom for bandwidth increase.

A remaining cost saving option (to allow more capacity) is to move to Spot Instances (~70% lower instance cost), but this would require additional DevOps investment to turn the "pet server" into "cattle", which is best handled by a separate ticket.