jhchabran commented 8 months ago

### Plan

CI Observability

Improving and fostering ownership & accountability of CI build & test performance in the monorepo, improving the reliability and speed of CI.

Problem

sg/sg is a complex product. Testing and build times are a key element in controlling the quality, yet teams have very little visibility into where they stand, or how their code affects other teams in CI, fostering learned self-helplessness.

Buildkite’s “Test Analytics” feature covers a lot of the same ground, but doesn’t include enough details around things such as the critical path, more customizable graphing capabilities, as well as aggregations besides P50 and unlocking future capabilities for introspecting what causes a cache miss etc.

From a cost measuring perspective, this has previously been a mostly manual process of extracting data from the Buildkite API and attempting to correlate cost with this data via google spreadsheets. On top of being a manual process, this misses out on the actual cost part of CI, which is the underlying infrastructure.

Success criteria

70% of tests in sg/sg have a clear owner
CI reliability >93%
Average build times <23min

Proposal

Ownership:

Provide a set of (manually updated) Bazel variables to be used as tags in tests to denote ownership (as a stretch goal, we can explore keeping this set automatically updated, depending on the reliability of the sources of truth we have available to us). CI will enforce that the percentage of tests with a tagged owner remains above 70% (possible through simple and fast bazel query commands), so that new tests added have clear ownership defined.

Observability:

Investigate the different sources of Bazel execution data (build event protocol, compact execution log & profile data) to see what combination of these we will need in order to extract the required information from Bazel. These will be stored as buildkite artifacts where they can be queried by the finalization task. Augment our GCP agent images to log a datapoint when they boot & shutdown, including relevant data from the metadata API server.

Alongside data from buildkite pipeline executions, test tagging & GCP details, we should be able to get answers to the following questions:

what % of total wall time was spent in tests, grouped by team
what % of the critical path (the lower bound on CI time) constitutes tests, grouped by team
what % of other peoples’ CI time is taken up by other teams’ tests
how many step retries are attributable to flakiness in tests, grouped by team (reliability of a groups’ tests)
Looking at a graph, can I point out that my changes to tests/builds have had a measurable improvement
when and what contributed to higher than expected CI costs over the last period

Together, they will provide both the basis for providing global and team-specific reports:

global reports for engineering leaders/managers to help with prioritizing or delegating to teams to look into their impact on CI times.
team-specific reports including figures including the above list, with links to dashboards providing more insight and the ability to dig deeper into the data in order to work on the right things.

Milestones

70% of Bazel test targets have a defined owner (estimated 1 week)
- Most of the time will be from coordinating with all the teams in order to get the ownership details and can be done async alongside further work.
- Value: ownership of flaky tests can be directly traced for accountability.
Data emitted by Bazel is exported to BigQuery (estimated 1 week)
- Will involve a discovery phase on whether the profile export is enough, or whether we want the execution log as well as well as some buffer time to gain familiarity with the (often badly documented) data formats.
- Value: what immediate value do we get at this point? Can query the data?
Weekly reports are sent to slack, providing insight into a teams’ tests effect on CI (estimated 3 days)
- This will be generated by AlfredBot and/or Looker, depending on the capabilities of Looker in generating reports in a format that suits us.
- This can first be posted to a single channel in order to test reliability, quality and usefulness of the reports.
- Refinement of the reports will be a continuous process in order to make them be most relevant & useful.
- Value: teams will start to see & feel accountability for the impact of their test quality on CI times.
Dashboards (likely on Looker) will be made available for both global + team-specific overviews to allow management to see everyones impact on CI
Buildkite agents run a systemd one-shot on boot & shutdown to log a row in BigQuery containing instance metadata from the metadata API server (estimated 2 days)
- Value: dev-infra can start attributing CI cost to developer activity in areas of the code resulting in longer than expected CI times.
CI checks whether new tests have an annotated owner/whether the 70% target is maintained (estimated 1 day)
- Value: ownership remains consistently high instead of developers omitting this information.

Risks

Teams are not motivated to improve test reliability/performance (time/headcount constraints, lack of urgency, other reasons)
The data does not provide a reliable set of numbers and ends up being ignored (tests unexpectedly not cached by Bazel, )

Tracked issues

@unassigned

Completed

[x] (🏁 2024-04-22) https://github.com/sourcegraph/sourcegraph/issues/61246
[x] (🏁 2024-05-12) https://github.com/sourcegraph/sourcegraph/issues/61275 (PRs: ~#62598~)

@Strum355

Completed

[x] (🏁 2024-04-03) https://github.com/sourcegraph/sourcegraph/pull/61510
[x] (🏁 2024-04-03) https://github.com/sourcegraph/sourcegraph/pull/61549
[x] (🏁 2024-04-03) https://github.com/sourcegraph/sourcegraph/pull/61554
[x] (🏁 2024-04-05) https://github.com/sourcegraph/sourcegraph/pull/61512
[x] (🏁 2024-04-05) https://github.com/sourcegraph/sourcegraph/issues/61264
[x] (🏁 2024-04-23) https://github.com/sourcegraph/sourcegraph/pull/62010
[x] (🏁 2024-04-29) https://github.com/sourcegraph/sourcegraph/issues/61717
[x] (🏁 2024-05-01) https://github.com/sourcegraph/sourcegraph/pull/62291
[x] (🏁 2024-05-12) https://github.com/sourcegraph/sourcegraph/pull/62598
[x] (🏁 2024-05-13) https://github.com/sourcegraph/sourcegraph/pull/62627
[x] (🏁 2024-05-13) https://github.com/sourcegraph/sourcegraph/pull/62632
[x] (🏁 2024-05-15) https://github.com/sourcegraph/sourcegraph/pull/62699
[x] (🏁 2024-05-16) https://github.com/sourcegraph/sourcegraph/pull/62664
[x] (🏁 2024-06-04) https://github.com/sourcegraph/sourcegraph/issues/61243

@jamesmcnamara

Completed

[x] (🏁 2024-06-04) https://github.com/sourcegraph/sourcegraph/issues/61716

sourcegraph-bot commented 8 months ago

Status Update

Date: 2024-02-14

Overall Status

🟢 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: CI Observability for everyone

Created by noah@sourcegraph.com

sourcegraph-bot commented 7 months ago

Status Update

Date: 2024-02-15

Overall Status

🟢 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by jean-hadrien.chabran@sourcegraph.com

sourcegraph-bot commented 7 months ago

Status Update

Date: 2024-02-28

Overall Status

🟢 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: N% of the tests have a clear owner in sg/sg

Created by jean-hadrien.chabran@sourcegraph.com

sourcegraph-bot commented 7 months ago

Status Update

Date: 2024-02-28

Overall Status

🟢 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by jean-hadrien.chabran@sourcegraph.com

sourcegraph-bot commented 7 months ago

Status Update

Date: 2024-03-15

Overall Status

🟢 On Track

Current: Working on writing up plan

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: N% of the tests have a clear owner in sg/sg

Created by noah@sourcegraph.com

sourcegraph-bot commented 7 months ago

Status Update

Date: 2024-03-15

Overall Status

🟢 On Track

Current: Working on writing up plan

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by noah@sourcegraph.com

sourcegraph-bot commented 6 months ago

Status Update

Date: 2024-03-19

Overall Status

🟢 On Track

Current: Plan completed, creating issue tracker tasks & determining order in order to parallelize work efficiently across the team

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: N% of the tests have a clear owner in sg/sg

Created by noah@sourcegraph.com

sourcegraph-bot commented 6 months ago

Status Update

Date: 2024-03-19

Overall Status

🟢 On Track

Current: Plan completed, creating issue tracker tasks & determining order in order to parallelize work efficiently across the team

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by noah@sourcegraph.com

sourcegraph-bot commented 6 months ago

Status Update

Date: 2024-03-29

Overall Status

🟢 On Track

Current: 45% of the spreadsheet has been filled in, soft deadline was set for Wednesday 3rd April

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: N% of the tests have a clear owner in sg/sg

Created by noah@sourcegraph.com

sourcegraph-bot commented 6 months ago

Status Update

Date: 2024-03-29

Overall Status

🟢 On Track

Current: First planned approach to triggering build metrics collection came up to a dead-end. An alternative approach reusing build-tracker service is in-progress. Currently modernizing its deployment to utilize MSP, bringing it in-line with future direction of deploying hosted services

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by noah@sourcegraph.com

sourcegraph-bot commented 6 months ago

Status Update

Date: 2024-04-08

Overall Status

🟢 On Track

Current: Triggering an async pipeline on build completion is working and live on MSP. Original build-tracker is still running while we observe the new MSP deployed one

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-04-15

Overall Status

🟢 On Track

Current: 65% of the sheet is filled in after I added some more directories and ownership. More time was given after some feedback on the deadline being too short. Final ping to EMs planned to go out tomorrow

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: N% of the tests have a clear owner in sg/sg

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-04-23

Overall Status

🟢 On Track

Current: We are now shipping Buildkite specific data to BigQuery. Bazel data is currently in-progress

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-04-15

Overall Status

🟢 On Track

Current: 65% of the sheet is filled in after I added some more directories and ownership. More time was given after some feedback on the deadline being too short. Final ping to EMs planned to go out tomorrow

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: N% of the tests have a clear owner in sg/sg

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-04-29

Overall Status

🟢 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: N% of the tests have a clear owner in sg/sg

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-05-06

Overall Status

🟢 On Track

Current: Have begun experimenting with dashboards in Redash and fixing up data issues that arise

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-05-08

Overall Status

🟢 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: N% of the tests have a clear owner in sg/sg

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-05-14

Overall Status

🟢 On Track

Current: Reached ~70.5% after examining the remaining tests and excluding irrelevant ones (e.g. diffs for generated files/copies etc). PR is being prepared

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: N% of the tests have a clear owner in sg/sg

Created by noah@sourcegraph.com

sourcegraph-bot commented 4 months ago

Status Update

Date: 2024-05-16

Overall Status

🏁 Completed

Current: PR is merged, so this OKR is technically complete. Follow-up work involves a CI check to maintain that level, as well as splitting out certain mega-packages into more distinct owners to reach a higher level

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: N% of the tests have a clear owner in sg/sg

Created by noah@sourcegraph.com

sourcegraph-bot commented 4 months ago

Status Update

Date: 2024-05-06

Overall Status

🟢 On Track

Current: Have begun experimenting with dashboards in Redash and fixing up data issues that arise

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by noah@sourcegraph.com

sourcegraph-bot commented 4 months ago

Status Update

Date: 2024-05-19

Overall Status

🟢 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by noah@sourcegraph.com

sourcegraph-bot commented 4 months ago

Status Update

Date: 2024-05-19

Overall Status

🟢 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by noah@sourcegraph.com

sourcegraph-bot commented 4 months ago

Status Update

Date: 2024-06-02

Overall Status

🟢 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

View Objective: 🟡 CI Observability for everyone: X build reliability / Y AVG build times

Created by noah@sourcegraph.com

sourcegraph / sourcegraph-public-snapshot