Discover RDS deploys version out of sync with deployment updater

GavinFrazar commented 1 month ago

I had an ECS cluster deployed via the integration already, but I had manually scaled the service tasks to 0.

In the background, the AWS OIDC integration updater checks if it should update ECS task definitions to a newer version of Teleport. It checks for a cluster maintenance window every 30 minutes, and if it's in the window then it updates ECS deployments.

In this case, this morning the auto-updater created a new task revision for v16.3.0 and updated my ECS cluster to use the new revision. The prior revision used v16.2.2.

Today, I went through the discover flow again, and when I clicked "Deploy Teleport Service" it updated my ECS cluster service to run 2 tasks again, but it also created a new task definition that again used v16.2.2.

So the ECS task definition revisions look like this: revision 1: teleport v16.2.2 (from when I created it the first time) revision 2: teleport v16.3.0 (auto-updater created this morning) revision 3: teleport v16.2.2 (redeployed teleport service this afternoon)

I found it quite surprising that my ECS service was downgraded.

We need to make sure the service deployment version matches what we use for the auto-updater.

Bug details:

Teleport version: 16.3.0

GavinFrazar commented 1 month ago

cc @marcoandredinis

marcoandredinis commented 1 month ago

The version check has slightly different conditions, but they seem to resolve to the same version 🤔 https://github.com/gravitational/teleport/blob/ca05dd139dd2cf7ed72fe5923f4d90e817efcfd8/lib/service/awsoidc.go#L196 https://github.com/gravitational/teleport/blob/b453967572bb2bf5f882d8553855ebcfdbf24e5d/lib/web/integrations_awsoidc.go#L152

Was this a self-hosted tenant? Maybe that's why

GavinFrazar commented 1 month ago

The version check has slightly different conditions, but they seem to resolve to the same version 🤔

https://github.com/gravitational/teleport/blob/ca05dd139dd2cf7ed72fe5923f4d90e817efcfd8/lib/service/awsoidc.go#L196

https://github.com/gravitational/teleport/blob/b453967572bb2bf5f882d8553855ebcfdbf24e5d/lib/web/integrations_awsoidc.go#L152

Was this a self-hosted tenant? Maybe that's why

it was a cloud staging tenant

GavinFrazar commented 3 weeks ago

On cloud staging tenant we set a forwarding url for the stable/cloud channel. It looks like this in my staging tenant proxy's ConfigMap:

      automatic_upgrades_channels:
        stable/cloud:
          forward_url: https://updates.releases.teleport.dev/v1/stable/cloud/v16

Our upgrader logic uses stable/cloud as the default channel on cloud tenants.

cc @hugoShaka I see this note from you. Looks like you predicted this inconsistency 😄 could you weigh in on this?

https://github.com/gravitational/teleport/blob/fa859053599524a016acbd95a9b4dd482f679e5a/lib/service/awsoidc.go#L71-L78

GavinFrazar commented 3 weeks ago

I think this is what is happening:

(re-)deployment

Proxy chooses the version. It gets the default version based on its own config, which on cloud staging tenants is stable/cloud. The stable/cloud version is configured with a forwarding url and can lag behind the deployed version of teleport.

Updater

Proxy chooses the version as well. However, it doesn't respect proxy config so it uses the deployed version of the proxy, i.e api.Version instead of stable/cloud.

Downgrading scenario

When I ran into this, I think the state of versions looked like this:

stable/cloud: was v16.2.2
api.Version (actual deployed proxy version): was v16.3.0

Timeline (event, version):

Initial manual deployment of ECS service, cloud/stable=v16.2.2 -> ECS task runs on v16.2.2
Auto-updater runs, api.Version=v16.3.0 -> ECS task upgrades to v16.3.0
Manual redeployment of ECS service, cloud/stable=v16.2.2 -> ECS task downgrades to v16.2.2

Fix

I think we need to make the auto-updater use the cloud/stable channel as well. It's already running on the proxy, this is just a matter of reading config to see we have a cloud/stable channel and forwarding url.

gravitational / teleport