Closed GavinFrazar closed 3 weeks ago
cc @marcoandredinis
The version check has slightly different conditions, but they seem to resolve to the same version 🤔 https://github.com/gravitational/teleport/blob/ca05dd139dd2cf7ed72fe5923f4d90e817efcfd8/lib/service/awsoidc.go#L196 https://github.com/gravitational/teleport/blob/b453967572bb2bf5f882d8553855ebcfdbf24e5d/lib/web/integrations_awsoidc.go#L152
Was this a self-hosted tenant? Maybe that's why
The version check has slightly different conditions, but they seem to resolve to the same version 🤔
Was this a self-hosted tenant? Maybe that's why
it was a cloud staging tenant
On cloud staging tenant we set a forwarding url for the stable/cloud
channel. It looks like this in my staging tenant proxy's ConfigMap:
automatic_upgrades_channels:
stable/cloud:
forward_url: https://updates.releases.teleport.dev/v1/stable/cloud/v16
Our upgrader logic uses stable/cloud
as the default channel on cloud tenants.
cc @hugoShaka I see this note from you. Looks like you predicted this inconsistency 😄 could you weigh in on this?
I think this is what is happening:
Proxy chooses the version.
It gets the default version based on its own config, which on cloud staging tenants is stable/cloud
.
The stable/cloud
version is configured with a forwarding url and can lag behind the deployed version of teleport.
Proxy chooses the version as well.
However, it doesn't respect proxy config so it uses the deployed version of the proxy, i.e api.Version
instead of stable/cloud
.
When I ran into this, I think the state of versions looked like this:
stable/cloud
: was v16.2.2api.Version
(actual deployed proxy version): was v16.3.0Timeline (event, version):
cloud/stable=v16.2.2
-> ECS task runs on v16.2.2api.Version=v16.3.0
-> ECS task upgrades to v16.3.0cloud/stable=v16.2.2
-> ECS task downgrades to v16.2.2I think we need to make the auto-updater use the cloud/stable
channel as well.
It's already running on the proxy, this is just a matter of reading config to see we have a cloud/stable channel and forwarding url.
I had an ECS cluster deployed via the integration already, but I had manually scaled the service tasks to 0.
In the background, the AWS OIDC integration updater checks if it should update ECS task definitions to a newer version of Teleport. It checks for a cluster maintenance window every 30 minutes, and if it's in the window then it updates ECS deployments.
In this case, this morning the auto-updater created a new task revision for v16.3.0 and updated my ECS cluster to use the new revision. The prior revision used v16.2.2.
Today, I went through the discover flow again, and when I clicked "Deploy Teleport Service" it updated my ECS cluster service to run 2 tasks again, but it also created a new task definition that again used v16.2.2.
So the ECS task definition revisions look like this: revision 1: teleport v16.2.2 (from when I created it the first time) revision 2: teleport v16.3.0 (auto-updater created this morning) revision 3: teleport v16.2.2 (redeployed teleport service this afternoon)
I found it quite surprising that my ECS service was downgraded.
We need to make sure the service deployment version matches what we use for the auto-updater.
Bug details: