quintilesims / layer0

Build, Manage, and Deploy Your Applications
Apache License 2.0
44 stars 20 forks source link

Investigate potential subnet/AZ trouble #646

Open tlake opened 4 years ago

tlake commented 4 years ago

We've observed services flapping with L0 v0.11.0 - it seems that sometimes a service is brought up in a subnet that isn't part of the load balancer's subnets, which causes the healthcheck to fail and for the task to be terminated and restarted. This flapping occurs until the service is randomly brought up in a subnet associated with the load balancer, or until a user manually adds the missing subnet to the load balancer.

The list of subnets is generated by l0-setup and then spat out as environment variables. It's possible that there's some bug in l0-setup that's gone unnoticed until now.

There's a comment in api/backend/ecs/load_balancer_manager.go in reference to the getSubnetsAndAvailZones() function that may be worth investigating:

// this is awkward, strongly assumes that PrivateSubnets will be distributed across AZs,
// using each at most once.  We error out on bad config for now, in the future we'll
// need to do something to calculate which subnets to use based on where the instance
// got provisioned.

We're not sure what would have changed between v0.10.10 and v0.11.0 that would have started making this a problem, but we haven't ruled it out yet either.

It also might be worth investigating the AWS Terraform provider and whether it's different between Terraform v0.11.x and v0.12.x. If the underlying logic that the provider uses has changed, and if only the v0.12.x provider has those changes, it could be the source of our troubles. If so, making Layer0 compatible with Terraform v0.12.x would be required to solve the problem.