sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

🎯 CI Observability for everyone #60455

Open jhchabran opened 8 months ago

jhchabran commented 8 months ago
### Plan

CI Observability

Improving and fostering ownership & accountability of CI build & test performance in the monorepo, improving the reliability and speed of CI.

Problem

sg/sg is a complex product. Testing and build times are a key element in controlling the quality, yet teams have very little visibility into where they stand, or how their code affects other teams in CI, fostering learned self-helplessness.

Buildkite’s β€œTest Analytics” feature covers a lot of the same ground, but doesn’t include enough details around things such as the critical path, more customizable graphing capabilities, as well as aggregations besides P50 and unlocking future capabilities for introspecting what causes a cache miss etc.

From a cost measuring perspective, this has previously been a mostly manual process of extracting data from the Buildkite API and attempting to correlate cost with this data via google spreadsheets. On top of being a manual process, this misses out on the actual cost part of CI, which is the underlying infrastructure.

Success criteria

Proposal

Ownership:

Provide a set of (manually updated) Bazel variables to be used as tags in tests to denote ownership (as a stretch goal, we can explore keeping this set automatically updated, depending on the reliability of the sources of truth we have available to us). CI will enforce that the percentage of tests with a tagged owner remains above 70% (possible through simple and fast bazel query commands), so that new tests added have clear ownership defined.

Observability:

Investigate the different sources of Bazel execution data (build event protocol, compact execution log & profile data) to see what combination of these we will need in order to extract the required information from Bazel. These will be stored as buildkite artifacts where they can be queried by the finalization task. Augment our GCP agent images to log a datapoint when they boot & shutdown, including relevant data from the metadata API server.

Alongside data from buildkite pipeline executions, test tagging & GCP details, we should be able to get answers to the following questions:

Together, they will provide both the basis for providing global and team-specific reports:

Milestones

Risks

Tracked issues

@unassigned

Completed

@Strum355

Completed

@jamesmcnamara

Completed

sourcegraph-bot commented 8 months ago

Status Update

Date: 2024-02-14

Overall Status

🟒 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 7 months ago

Status Update

Date: 2024-02-15

Overall Status

🟒 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by jean-hadrien.chabran@sourcegraph.com

sourcegraph-bot commented 7 months ago

Status Update

Date: 2024-02-28

Overall Status

🟒 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by jean-hadrien.chabran@sourcegraph.com

sourcegraph-bot commented 7 months ago

Status Update

Date: 2024-02-28

Overall Status

🟒 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by jean-hadrien.chabran@sourcegraph.com

sourcegraph-bot commented 7 months ago

Status Update

Date: 2024-03-15

Overall Status

🟒 On Track

Current: Working on writing up plan

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 7 months ago

Status Update

Date: 2024-03-15

Overall Status

🟒 On Track

Current: Working on writing up plan

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 6 months ago

Status Update

Date: 2024-03-19

Overall Status

🟒 On Track

Current: Plan completed, creating issue tracker tasks & determining order in order to parallelize work efficiently across the team

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 6 months ago

Status Update

Date: 2024-03-19

Overall Status

🟒 On Track

Current: Plan completed, creating issue tracker tasks & determining order in order to parallelize work efficiently across the team

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 6 months ago

Status Update

Date: 2024-03-29

Overall Status

🟒 On Track

Current: 45% of the spreadsheet has been filled in, soft deadline was set for Wednesday 3rd April

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 6 months ago

Status Update

Date: 2024-03-29

Overall Status

🟒 On Track

Current: First planned approach to triggering build metrics collection came up to a dead-end. An alternative approach reusing build-tracker service is in-progress. Currently modernizing its deployment to utilize MSP, bringing it in-line with future direction of deploying hosted services

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 6 months ago

Status Update

Date: 2024-04-08

Overall Status

🟒 On Track

Current: Triggering an async pipeline on build completion is working and live on MSP. Original build-tracker is still running while we observe the new MSP deployed one

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-04-15

Overall Status

🟒 On Track

Current: 65% of the sheet is filled in after I added some more directories and ownership. More time was given after some feedback on the deadline being too short. Final ping to EMs planned to go out tomorrow

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-04-23

Overall Status

🟒 On Track

Current: We are now shipping Buildkite specific data to BigQuery. Bazel data is currently in-progress

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-04-15

Overall Status

🟒 On Track

Current: 65% of the sheet is filled in after I added some more directories and ownership. More time was given after some feedback on the deadline being too short. Final ping to EMs planned to go out tomorrow

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-04-29

Overall Status

🟒 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-05-06

Overall Status

🟒 On Track

Current: Have begun experimenting with dashboards in Redash and fixing up data issues that arise

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-05-08

Overall Status

🟒 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 5 months ago

Status Update

Date: 2024-05-14

Overall Status

🟒 On Track

Current: Reached ~70.5% after examining the remaining tests and excluding irrelevant ones (e.g. diffs for generated files/copies etc). PR is being prepared

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 4 months ago

Status Update

Date: 2024-05-16

Overall Status

🏁 Completed

Current: PR is merged, so this OKR is technically complete. Follow-up work involves a CI check to maintain that level, as well as splitting out certain mega-packages into more distinct owners to reach a higher level

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 4 months ago

Status Update

Date: 2024-05-06

Overall Status

🟒 On Track

Current: Have begun experimenting with dashboards in Redash and fixing up data issues that arise

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 4 months ago

Status Update

Date: 2024-05-19

Overall Status

🟒 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 4 months ago

Status Update

Date: 2024-05-19

Overall Status

🟒 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com

sourcegraph-bot commented 4 months ago

Status Update

Date: 2024-06-02

Overall Status

🟒 On Track

Notes

N/A

Blockers/Risks/Concerns

N/A

More Information

Created by noah@sourcegraph.com