pytorch / test-infra

This repository hosts code that supports the testing infrastructure for the main PyTorch repo. For example, this repo hosts the logic to track disabled tests and slow tests, as well as our continuation integration jobs HUD/dashboard.
https://hud.pytorch.org/
Other
77 stars 74 forks source link

Measure p90 latency and failure rate of HUD PR and commit pages #556

Open huydhn opened 2 years ago

huydhn commented 2 years ago

Pitch

After all recent Rockset and GitHub outages, I think we should start tracking some KPI for important developer facing HUD pages like PR and commit pages. Here are some observations:

Screen Shot 2022-08-18 at 14 53 22

Solutions

TBD. May be vercel already has these metrics somewhere that we can just tap in

cc @janeyx99 @ZainRizvi @pytorch/pytorch-dev-infra

ZainRizvi commented 2 years ago

We def need to get better about tracking metrics on our tools in general, at the least storing them somewhere to help with future debugging.

For prioritization: I'd suggest tracking usage of the tools (like who/how often people even click on HUD links) is the top priority, and then evaluating the quality of those experiences comes in second

huydhn commented 2 years ago

True, we have that other task to track tool usage mapped out at https://github.com/pytorch/test-infra/issues/530. So we can address it before this one