m-lab / siteinfo

M-Lab Public Site Information Automation
Apache License 2.0
4 stars 0 forks source link

Adds placeholder sites for new production MIGs #327

Closed nkinkade closed 6 months ago

nkinkade commented 6 months ago

I'm trying this as a new strategy for deploying virtual sites. The old way was to create the resources in GCP using terraform-support, then note the static IPs created by GCP, then create the siteinfo records. The problem with this is that once the GCP resources are created, they join the cluster and start running workloads. Some of these workloads faile (namely uuid-annotator) because the site doesn't yet exist in siteinfo. This causes production alerts that cannot be managed with GMX, since GMX will not put anything into maintenance that doesn't exist in siteinfo.

The idea here is to create the site here in siteinfo, with an RFC1918 IPv4 address and no IPv6 address, and then push this through to production. Once this is in production, then an operator should place the sites into GMX maintenance mode, then run the terraform-support deployment to actually create the resources.

This is still not perfect because deploying siteinfo triggers a build in prometheus-support, which updates monitoring targets. Monitoring will then start probing machines that don't exist, with non-public IP addresses. This may not, but has the potential to cause alerts to fire. Additionally, GMX only reloads siteinfo data about every 5h, with an actual window of anywhere from 1 to 24h. This means that even though the machines are in siteinfo, GMX may still refuse to add them to maintenance unless it has already refreshed its copy of siteinfo. It may be necessary to manually restart GMX in production in these cases to ensure it has the latest version of siteinfo data so that the new sites can be put into maintenance.


This change is Reviewable