uselagoon / lagoon

Lagoon, the developer-focused application delivery platform
https://docs.lagoon.sh/
Apache License 2.0
555 stars 148 forks source link

Project deletion is not atomic #3593

Open smlx opened 9 months ago

smlx commented 9 months ago

Describe the bug

Project deletion through the Lagoon API consists of (at least) the following steps:

  1. Remove project in API DB.
  2. Remove project from all groups in Keycloak.
  3. Remove the default project group in Keycloak.
  4. Remove the default user in Keycloak.

Because some of these resources reside in Keycloak, not in the API DB, these operations are not atomic. If one fails it does not fail the API call. That is, each individual operation is run in a "fire and forget" manner.

The problem is that when one or more operation fails, the system will remain in an inconsistent state. This could cause a relatively benign problem where a project with the same name as a previously deleted project cannot be created as some remnant remains in the system. Alternatively this could become a more serious problem if e.g. a new project is created retaining the attributes of a previously deleted project (e.g. group or organization membership).

To Reproduce

Steps to reproduce the behavior:

  1. Have some network calls between Lagoon API and Keycloak fail (this can happen during normal operations - networks are unreliable).
  2. Delete a project via the GraphQL API.
  3. Inspect Lagoon API and see some errors during the operation.
  4. Inspect Lagoon API DB and possibly see the project still exists in the DB.
  5. Inspect Keycloak and possibly see the project default group still exists.

Expected behavior

Operations such as deleting a project should be atomic. Such operations should either succeed completely, or fail with an error (and roll back state). Otherwise the integrity of Lagoon's data model is compromised.

Screenshots

Here's a project default group where the organization ID is retained, but there are no projects which are part of the group. screenshot_2023-11-14-134837

Additional context

The specific problem in the screenshot seems to have occurred during project deletion because the deletion process first removes the project ID from all groups, and then removes the project default group. Either one of these operations can fail with only an error logged, and without returning any error to the API caller. In this case, it looks as though the first operation succeeded, while the second operation failed.

smlx commented 9 months ago

I stumbled across this project today which implements transactional job processing in Go, and reminded me of this issue. I assume there are similar libraries for node?