ooni / devops

0 stars 1 forks source link

Move test helpers from digital ocean to AWS #47

Closed hellais closed 5 months ago

hellais commented 5 months ago

Test helper rotation script is broken and manual changes were made to DNS to unbrick it on 18th March 2024: https://openobservatory.slack.com/archives/C38EJ0CET/p1710780947922739.

Following this incident the NS delegation of th.ooni.org has been migrated over to AWS, which currently hosts the following A records: 0.th.ooni.org -> 146.190.119.3, 2604:a880:4:1d0::69e:f000 ​​1.th.ooni.org -> 161.35.89.250, 2a03:b0c0:2:d0::1768:9001 2.th.ooni.org -> 161.35.89.250, 2a03:b0c0:2:d0::1768:9001 3.th.ooni.org -> 146.190.119.3, 2604:a880:4:1d0::69e:f000

Note that 1 and 2 and 0 and 3 point to the same IP, because there were only 2 running VPS that were not broken from the auto rotation script.

Plan for migration

We plan to migrate all these test helpers over to the AWS ECS based configuration, see: https://github.com/ooni/devops/blob/main/tf/environments/prod/main.tf#L505.

All the previous addresses will be configured to point to ALB entry (see: https://github.com/ooni/devops/blob/main/tf/modules/oonith_service/main.tf#L176) for the oonith_service as aliases (effectively it behaves like a CNAME, but costs less: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resource-record-sets-choosing-alias-non-alias.html).

Checklist

hellais commented 5 months ago

The charts for the migration show that it worked well without any major issue.

The jumps in the chart were caused by two incidents during the migration:

  1. When we flipped 3.th, we hadn't dropped it from the returned addresses and so there were about 15 minutes of unavailability due to it taking some time to perform the flip
  2. We learned from that and performed the flip of 0.th only after dropping it from the rotation and ensuring the traffic dropped to near zero (https://github.com/ooni/backend/pull/838), however there was a bug in the availability zone mapping which lead 2 minutes in downtime: https://github.com/ooni/devops/pull/49.

Below I share the latest charts of the failure rates and measurement counts for historical record: visualization (37) visualization (38) visualization (39) visualization (40) visualization (41) visualization (42)

I am now going to move forward with the final step which is destroying the 2 remaining hosts on digital ocean.

hellais commented 5 months ago

The droplets are deleted on digital ocean