Closed marco-c closed 6 years ago
Also, main + content - content shutdown (which is the metric we currently use to evaluate the overall stability of a release).
The code to accumulate content_crashes and content_shutdown_crashes numbers is here:
As far as I can tell from reading the source (e.g. https://searchfox.org/mozilla-central/rev/97cb0aa64ae51adcabff76fb3b5eb18368f5f8ab/dom/ipc/ContentParent.cpp#3137 ; https://searchfox.org/mozilla-central/rev/97cb0aa64ae51adcabff76fb3b5eb18368f5f8ab/ipc/glue/CrashReporterHost.cpp#280) the two types are being tracked completely seperately. @chutten could probably confirm.
I have mentioned this to a few people already, but I think it's best to look at the different measures in isolation to gain a holistic picture of how a release is doing rather than trying to come up with a magic number for release health. Some work has been done to do this in the channel/platform summary page already, my plan was to surface some of the most salient/important information on the front page as well.
content_crashes
is the total number of SUBPROCESS_CRASHES_WITH_DUMP
from the content process. content_shutdown_crashes
is the number of SUBPROCESS_KILL_HARD
with the reason ShutDownKill
.
M+C-S (main plus content minus content shutdown) has been used as a "crashes that users are likely to care about" figure for as long as e10s has been a thing (so, about two years). It is generally the "crash" category of any release health metric.
content_crashes
on its own isn't a useful measure of release health. Users don't seem to care about shutdown crashes at all (they often happen completely transparently to users), and shutdown crash rates are wildly variable from day to day. content_crashes
is only useful for determining user-facing content crashes when the content_shutdown_crashes
are removed from them.
I concur broadly with the "there is no magic number for release health" but in this case we must remove the shutdown crashes from the content crashes figure if we are to understand the impact to users. Or perhaps a view with non-shutdown content crashes on the bottom of the area plot, and the shutdown crashes on top.
I think we might want to fix this within telemetry-streaming so that the content_crashes
excludes content_shutdown_crashes
. I'd really rather not have to do a bunch of UI magic inside mission control to handle this distinction (which IMO users shouldn't have to care about).
Update: we are going to start tracking content crashes and content shutdown crashes as seperate entities in our underlying dataset. Closing this.
If it does, it would be useful to have an additional graph with content_crashes - content_shutdown_crashes, like https://telemetry.mozilla.org/crashes/.