open-telemetry / community

OpenTelemetry community content
https://opentelemetry.io
Apache License 2.0
754 stars 229 forks source link

Incident Report: Teams were deleted during trial deployment of CLOWarden. #2356

Open austinlparker opened 2 days ago

austinlparker commented 2 days ago

This issue briefly covers the incident that took place on 09/18/24 starting at 8:49 AM Pacific/11:49 AM Eastern. All times in the document will be in Eastern time.

Timeline of events

Approx. 11:45 AM - A change was pushed to the OpenTelemetry CLOWarden configuration to address a new configuration file location (open-telemetry/people). This configuration file was deliberately limited -- it defined one team, and one repository. Prior to this change, a separate installation of CLOWarden was used in a separate organization to validate the expected outcome with no issues.

11:48 AM - A PR was made to the new config.yaml file. This change added a single user. CLOWarden picked up this change and began to validate the current state vs. the desired state. For an unknown reason, the behavior of CLOWarden differed in the live environment, and it began to apply changes to the current state of the open-telemetry organization prior to approving the PR.

11:49 AM - Teams are deleted via CLOWarden and the GitHub API.

11:49 AM - Infra SIG (@austinlparker and @trask) pull the most recent CLOWarden state backup (from community/config.yaml) and begin to address statefile issues in order to re-create teams from this backup.

~12:20 PM - The statefile passes validation, but CLOWarden fails to process it due to a lack of repository information in the config.yaml. Out of an abundance of caution, we decide that we do not want to add repository information without fully understanding the application code as the existing behavior was a surprise/unexpected. A Rust SME is invited into the SIG call (@TommyCpp) who suggests that we pursue an alternative restoration strategy via Python script and remove CLOWarden from the loop entirely. A brief discussion ensues, and all parties agree on this course of action.

~1:45 PM - A script is developed and tested to re-create the teams based on the backup directly via GitHub API.

~1:49 - The team recreation job begins. We begin to work on restoring team/repository permissions.

2:30 PM - A script to repair repository/team permissions has been written and tested. Beginning to roll out updates.

2:39 PM - Permissions have been modified for all repositories based on the backup and we have confirmed this in the GitHub UI.

Current Status

This incident is now closed. This thread will be left open for discussion until we have a more comprehensive retrospective and learnings available, which will be posted on opentelemetry.io. Thank you for your patience as we resolved this issue.

trask commented 2 days ago

it began to apply changes to the current state of the open-telemetry organization prior to approving the PR

this is the part that most baffles me

mtwo commented 2 days ago

Thanks @austinlparker and @trask for working so quickly to fix this! Infra / tooling changes like this are always fragile and are often impossible to test, especially when they do things like that ^

trask commented 2 days ago

TODO

dyladan commented 2 days ago

~Looks to me like the JS maintainer and approver teams were not recreated~

image

edit: I was wrong, it's javascript not js

marcalff commented 2 days ago

Also missing configuration-approvers, which was created very recently:

TommyCpp commented 2 days ago

Missing rust-approvers and rust-maintainers

Huh it's there right

austinlparker commented 2 days ago

Missing rust-approvers and rust-maintainers

https://github.com/orgs/open-telemetry/teams?query=rust

Looks like they're there?

austinlparker commented 2 days ago

@open-telemetry/rust-approvers @open-telemetry/rust-maintainers

trask commented 2 days ago

TODO

  • review teams that were created since the backup that we used from 2 months ago
  • ping all teams here to ask them to validate their current membership since the backup we used was from 2 months ago

it looks like we can get all of the changes since the backup via the GitHub audit logs:

I'll work on applying these changes...

austinlparker commented 2 days ago

Thanks, @trask!

florianl commented 2 days ago

ebpf-profiler-approvers and ebpf-profiler-maintainers (please see also recent changes from https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/151) seems to be missing for https://github.com/open-telemetry/opentelemetry-ebpf-profiler/.

trask commented 2 days ago

all changes to team membership since the July 12 backup have been applied (from the github audit log)

looking now to see if I can get repo permissions updates from the audit log...

trask commented 2 days ago

everything should be resolved now

please report any issues here, thanks!

yurishkuro commented 2 days ago
  1. Could we link to the PRs mentioned in the timeline, to better understand the changes?
  2. Was CLOWarden run from a workflow or as an external app?
  3. Is there a dry-run option in CLOWarden where it logs what changes it would execute?
austinlparker commented 2 days ago
  1. Could we link to the PRs mentioned in the timeline, to better understand the changes?

  2. Was CLOWarden run from a workflow or as an external app?

  3. Is there a dry-run option in CLOWarden where it logs what changes it would execute?

  1. This PR (now repurposed) shows the changes from CLOWardens perspective. https://github.com/open-telemetry/gh-manager/pull/2#issuecomment-2358838971 - I linked to the specific comment it gave when modifying the limited state file; You can look at it in the PR history. Subsequent commits/messages are from when we tried to recover using CLOWarden vs. writing our own script.

  2. CLOWarden is run as an external application that watches PRs in specified repositories.

  3. No, there is no dry run option that I'm aware of or found. This is why we tried a small test on a separate organization. The setup for that test was an organization with an existing team that was not managed in the state file. A change was made to add a new team. In that instance, it did not automatically delete the existing team. The behavior in todays case was rather surprising given both our prior tests, as well as our expectations for how a desired state engine would function (for example, we expected that it would not make changes without human intervention/approval for a PR, which is what the docs suggest) lol

tegioz commented 1 day ago

Hi 👋

This is Sergio, one of the CLOWarden maintainers.

First of all, I'm sorry you run into some problems while testing CLOWarden. Please see my response here for more details about this incident: https://github.com/cncf/clowarden/issues/261#issuecomment-2361087358

tegioz commented 1 day ago
  1. No, there is no dry run option that I'm aware of or found.

You can simulate a dry-run by using the diff subcommand of the CLI tool.

The behavior in todays case was rather surprising given both our prior tests, as well as our expectations for how a desired state engine would function (for example, we expected that it would not make changes without human intervention/approval for a PR, which is what the docs suggest).

From CLOWarden's README file, in the How it works section:

CLOWarden's main goal is to ensure that the resources' desired state, as defined in the configuration files, matches the actual state in the corresponding services. This is achieved by running reconciliation jobs, that can be triggered on-demand or periodically.

CLOWarden monitors pull requests created in the configuration repository and, when applicable, it validates the proposed changes to access rules and creates reconciliation jobs to apply the necessary changes. This is what we call an on-demand reconciliation job: it's created as a result of a user's submitted pull request, and changes are applied immediately once the pull request is approved and merged.

Sometimes, however, this may not be enough. Changes can be applied manually to the service bypassing the configuration files (i.e. from the GitHub settings UI), and CLOWarden still needs to make sure that the actual state matches the desired state documented in the configuration files. So in addition to on-demand reconciliation jobs, CLOWarden runs periodic ones to ensure everything is all right all the time.

In any case, we'll try to improve the docs to make it clearer how it works (i.e. highlighting some important parts) and avoid potential problems in the future to other users.

austinlparker commented 1 day ago

I didn't see a way to configure the periodic reconciliation, or even an explanation of when it runs. Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version. When I ran the diff (which is how we had the backup to begin with), CLOWarden complained about it (you can see the complaints in the linked PR above).

Ultimately, we did not fully understand or appreciate how CLOWarden functioned here, and did not have the requisite expertise in Rust to really audit the code or appreciate its functionality. I believe that better documentation would have helped here, or perhaps the ability to clearly configure/understand the periodic reconciliation loop.

tegioz commented 1 day ago

Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version.

I understand, that's fine. We needed to be able to remove teams/permissions/etc manually added from the GitHub UI without maintainers interaction or approval, it was a requirement for us.

When I ran the diff (which is how we had the backup to begin with), CLOWarden complained about it (you can see the complaints in the linked PR above).

Could you point me to those logs please? I saw the clowarden-server logs in the issue you created, but I didn't see any run of the diff subcommand (dry-run). Would love to see what it reported.

I mean the output of this command, which should detail all changes that would be applied when the service is launched (all resources to be deleted in this case):

export GITHUB_TOKEN=...
clowarden-cli diff --org open-telemetry --repo gh-manager --branch main
austinlparker commented 1 day ago

I don't have the logs for the diff command unfortunately, I ran it several weeks ago.

yurishkuro commented 1 day ago

Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version.

Isn't that what you would expect from the automation? The config becomes the source of truth, and if the change to config is merged the automation should apply it without additional manual confirmation. Are we saying that in this case the config accurately represented SOTW but CLOWarden went and erased everything?

austinlparker commented 1 day ago

Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version.

Isn't that what you would expect from the automation? The config becomes the source of truth, and if the change to config is merged the automation should apply it without additional manual confirmation. Are we saying that in this case the config accurately represented SOTW but CLOWarden went and erased everything?

I would expect that if you had a PR open that modified the state, that would be the one that took effect, not the diff between the empty state and the PR, if that makes sense?

yurishkuro commented 1 day ago

My understanding of the explanation provided that PR had nothing to do with it. Both periodic and adhoc syncs are still run off the merged config. So if the config was actually empty in main then deleting all groups would be the expected behavior once the app is enabled. But if the config was actually reflective of SOTW but CLOWarden's some internal state was empty and it deleted everything - then yes, it would be a big problem. I still haven't seen the actual PR - the timeline says there was a config change where the config file location was changed, and perhaps to some new version of config that didn't represent SOTW - then again, the behavior was expected (but not before PR was merged).

if CLOWarden has a CLI that can apply the changes, I would avoid using external app and start running the CLI in dry mode for PRs, and only tell it to apply the changes in manual runs (e.g. a dispatched workflow, after diff is verified). Once we gain confidence in it, we can enable auto-runs.

austinlparker commented 1 day ago

My understanding of the explanation provided that PR had nothing to do with it. Both periodic and adhoc syncs are still run off the merged config. So if the config was actually empty in main then deleting all groups would be the expected behavior once the app is enabled. But if the config was actually reflective of SOTW but CLOWarden's some internal state was empty and it deleted everything - then yes, it would be a big problem. I still haven't seen the actual PR - the timeline says there was a config change where the config file location was changed, and perhaps to some new version of config that didn't represent SOTW - then again, the behavior was expected (but not before PR was merged).

if CLOWarden has a CLI that can apply the changes, I would avoid using external app and start running the CLI in dry mode for PRs, and only tell it to apply the changes in manual runs (e.g. a dispatched workflow, after diff is verified).

The PR is the PR I linked upthread. I understand from the explanation given why things worked the way they did, even if I didn't feel that it was clear it would happen this way before.

  1. We created a new repo and added a new config file (config.yaml, seen here https://github.com/open-telemetry/gh-manager/commit/4bceb978b5de86875eb6332889edd26222a2b553) that had the initial team we wanted to manage through CLOWarden. We did this because prior maintainer feedback was that certain teams did not wish to use CLOWarden, so our initial spike was going to be managing a limited set of teams. In testing, this appeared to work fine, but that is because we got lucky.
  2. Once the new config file was created, the CLOWarden service was restarted in k8s to pick up the new config file location (previously, it was pointing to community/config.yaml). This is the critical difference between the test environment and the live one, we did not restart CLOWarden in that case.
  3. As CLOWarden restarted, it read the config.yaml file that existed with only one team, and its periodic reconciliation engine deleted all other teams to match the state in that file.

This is all relatively academic because we aren't going to be using CLOWarden after this since we don't feel that we can support a tool we don't really understand, and it won't allow us to slowly roll out features to the entire organization.

yurishkuro commented 1 day ago

Fair enough - if we achieve the same with ~300 lines of Python, I would also go with that for better control.