projectsyn / lieutenant-api

The Project Syn Kubernetes Cluster and Tenants Inventory API
https://docs.syn.tools/lieutenant-api/
BSD 3-Clause "New" or "Revised" License
9 stars 1 forks source link

Race conditions when creating cluster: 409 response or missing installURL #229

Open davidgubler opened 2 weeks ago

davidgubler commented 2 weeks ago

There are at least two race conditions when creating a cluster:

I assume the two problems are related.

Steps to Reproduce the Problem

Create a cluster with valid data, e.g. the problem has been reproduced by POSTing {"id":"c-unicornfarts7","displayName":"Unicorn FARTS 7!","annotations":{},"tenant":"t-sg-test","facts":{"cloud":"c","service_level":"best_effort","access_policy":"regular","release_channel":"a","sales_order":"NONE","distribution":"b"},"gitRepo":{"type":"auto"}} to /clusters.

Being a race condition the problem is hard to reproduce reliably.

Actual Behavior

Sometimes it works, sometimes I get back a cluster that doesn't have an installURL, sometimes I get back a 409 conflict with body {"reason":"Operation cannot be fulfilled on clusters.syn.tools 'c-unicornfarts7': the object has been modified; please apply your changes to the latest version and try again"}

Expected Behavior

I reliably get back a cluster object with installURL set.

Moreover the 409 response code does not appear in https://syn.tools/lieutenant-api/references/index.html#_createcluster . Whether that's an implementation bug or a documentation bug I don't know, but it should be fixed one way or another.

bastjan commented 2 weeks ago

The API does not do an atomic create (two requests): https://github.com/projectsyn/lieutenant-api/blob/7254d6b756f02098c0d4cd77fcc68664ca74326f/pkg/service/cluster.go#L113 https://github.com/projectsyn/lieutenant-api/blob/7254d6b756f02098c0d4cd77fcc68664ca74326f/pkg/service/cluster.go#L117

And it also does not wait for the install token to be set.

The status update can be removed most likely. And the API should either wait for the token to be set or we do the token creation in a mutating webhook. I believe the mutating webhook is the nicer way to use the manifests but the lieutenant code is such a horrible mess we might just wait in the API.

Previously it might have worked ok since the time between the two requests was just enough to create the token.