Open wlach opened 6 years ago
Because we merge directly from nightly->beta, to me nightly stability is pretty important, especially as it gets toward the end of the cycle.
I think it would be helpful to have an identical way of measuring things in place, so that we can compare rates across channels. But we don't mind if there are then some channel-specific extras built on top of that...
I think the numbers on the main dashboard should be calculated with an average over the last 3 full days, for example:
Linux:
Date | content_crashes | main_crashes | Total |
---|---|---|---|
06/08/18 | 2.19 | 5.16 | 7.35 |
05/08/18 | 3.46 | 5.47 | 8.93 |
04/08/18 | 2.16 | 5.52 | 7.68 |
Average | 2.6 | 5.38 | 7.99 |
The numbers above are the real numbers for Desktop Nightly today, but on the main dashboard, we have this summary for Linux Nightly:
Firefox Linux (63) | -- | change |
---|---|---|
content_crashes | 5.32 | +55% |
main_crashes | 3.1 | -36% |
Total | 8.42 |
If I look at the main dashboard, I get the information that the main stability problem we have today for shipping Linux are content crashes which increased significantly over the last cycle. But if I look at the last days on the graph, I see that our main stability problems today on Nightly are the main process crashes. In a nutshell, the summary is giving the opposite information to the graph.
If the main dashboard showed the information from the Average row on the first table, it would depict a more correct information to release managers because we want to know the state of what we will ship over what we shipped in the past.
Similarly, the reference data for Nightly in the previous cycle should be the 3 days before the merge to beta because this is what we shipped.
I took Linux as an example but this is also true for other platforms. The problem is even more visible on Android where we had a big spike in crashes due to the SDK 26 migration. This spike is now fixed and on the graph we have a value of 19.06 crashes per 1k hours (better than on the release channel where we have 29) while on the main dashboard we have a value of 58.
See also #318. We should definitely do something here, ideally I'd like some kind of general solution that applies across the different channel types. The need to know (in general) "what's happening right now" seems like a real requirement.
I see this as being a bit different from what is in #318. I think nightly can sometimes tend to have explosive issues that need to be detected and resolved quickly, especially when it gets close to the merge to beta. The population size is different, so when something happens if we don't do a backout or take corrective action we lose users from a smaller population. So I am on the fence as to whether we can have a general solution that will apply equally to nightly, beta and release.
We may want to consider creating a nightly only view in mission control since the use cases + data are so different.
Things to consider: