This repository hosts code that supports the testing infrastructure for the main PyTorch repo. For example, this repo hosts the logic to track disabled tests and slow tests, as well as our continuation integration jobs HUD/dashboard.
With the recent change to show the list of active SEVs on Dr. CI message, we have opened up a new way to communicate this critical piece of information to PyTorch contributors beyond the meta-only Tree Huggers channel. On the other hand, once a SEV is resolved, its message would disappear from Dr. CI, which leaves several important aspects open:
What are the steps (if any) required by the contributors to bring their PRs to green? It could be retrying for infra SEVs, or rebase pass master to get the fixes, or just to verify and force merge
Surfacing the SEVs resolution for 12 hours after they're closed to focus what users can do to mitigate the issues and to prevent them from happening again (if possible)
Solution
We have already queried all SEVs in Dr.CI including resolved ones. A simple approach would be to keep them a while longer, i.e. 1 day, and to parse the SEVs body to gather the above information.
If the mitigation wasn't filled out or is left as the template default, assume no mitigation is needed.
References
For getting started with development read these two READMEs:
Pitch
With the recent change to show the list of active SEVs on Dr. CI message, we have opened up a new way to communicate this critical piece of information to PyTorch contributors beyond the meta-only Tree Huggers channel. On the other hand, once a SEV is resolved, its message would disappear from Dr. CI, which leaves several important aspects open:
Solution
We have already queried all SEVs in Dr.CI including resolved ones. A simple approach would be to keep them a while longer, i.e. 1 day, and to parse the SEVs body to gather the above information.
If the mitigation wasn't filled out or is left as the template default, assume no mitigation is needed.
References
For getting started with development read these two READMEs:
Dr CI When a PR has fails any CI checks, we have a github bot called Dr. CI which leaves a comment at the top of the PR calling out the failures. Dr CI bot entrypoint (triggered by Github webhooks): https://github.com/pytorch/test-infra/blob/main/torchci/lib/bot/drciBot.ts
Example Dr. CI output with CI failures:
Sev: Sevs are github issues filed against pytorch/pytorch with the
ci: sev
label. They're created using a template Example: https://github.com/pytorch/pytorch/issues/92626cc @clee2000 @ZainRizvi @pytorch/pytorch-dev-infra