protocol / prodeng

Issues, discussions and documentation from the production engineering team
2 stars 1 forks source link

Thunderdome: Automate Tracking of Kubo Releases #20

Open iand opened 2 years ago

iand commented 2 years ago

What Is It?

Automatically run Thunderdome experiments to compare Kubo release candidates with the last release to look for performance regressions.

Deliverables

Why Are We Doing It?

Currently the Kubo release procedure relies on a manual step and human knowledge and judgement. Release candidates are deployed to a subset of live nodes so metrics can be collected every day and problems assessed.

We can speed up this process and save engineer time by automatically deploying release candidates as Kubo experiments that compare version X with version X+1. The quality of Kubo releases can be improved if there is tooling to validate no performance regression between versions.

Tasks

Work in progress...

Now being tracked as part of probe lab: https://www.notion.so/pl-strflt/Thunderdome-Kubo-Release-Candidate-testing-dc474ec6634d40b69a9b36cd6151c3e6

BigLep commented 2 years ago

This is exciting ProdEng.

Carying over some things from slack.

It would be really useful for speeding up, automating, and improving the quality bar of Kubo releases if there is tooling to validate no performance regression between version X and version X+1 RC. Right now this is reliant on a manual step and human knowledge and judgement. I'd love to take the guess work out, which would also be key for fully automating releases. This ideally be something that all IPFS implementations do in their release process to ensure no performance regressions.

Being more specific, I want to eradicate this line manual line of the Kubo release process: https://github.com/ipfs/kubo/blob/master/docs/RELEASE_ISSUE_TEMPLATE.md?plain=1#L77

"Collect metrics every day. Work with the Infrastructure team to learn of any hiccup"

It relies on humans (and thus takes human attention), it means different things to different people (and thus is inconsistently applied), etc.

We'll see when we get to it, but we're planning to bring @protocol/ipdx (e.g., @galargh) in to help improve (automate/simplify) Kubo's release process (there's a note of it here). This might be a good area for them to join in on to help land.

galargh commented 2 years ago

Thanks for the tag Steve, I'd be very much interested in getting involved!

Just to clarify, when you say without changing, does it mean that the goal is to not rely on committing any special code to the tracked repo nor to change its' permissions in any way to enable tracking? Shouldn't be a problem at all with public repos but I just wanted to make sure I understand the goal properly.

thattommyhall commented 2 years ago

That "goal" was more to avoid needing to coordinate with upstream, its not necessarily a preference. It's sounding like we are going to be able to collaborate on this very soon!

One thing we could do is leave the building of images to you (with our guidance, you need to stop reproviding and configure the gateway)

Then either deligate creating an experiment to you, with guidance, by making appropriate IAM role for you to add taskdefs and ECS services (as that is all an experiment is) or @iand and I spoke about making a little api so someone can call it and start an experiment / add a target (more work but leaves us in control of exactly what running an experiment means)

Edit: it's occured to me that in place of an API we could have a lambda to start/stop an experiment and add targets to an existing experiment, might be simpler than writing an API with auth etc