mozilla / missioncontrol

Real-time monitoring of Firefox release health
Mozilla Public License 2.0
21 stars 18 forks source link

Does content_crashes include content_shutdown_crashes? #202

Closed marco-c closed 6 years ago

marco-c commented 6 years ago

If it does, it would be useful to have an additional graph with content_crashes - content_shutdown_crashes, like https://telemetry.mozilla.org/crashes/.

marco-c commented 6 years ago

Also, main + content - content shutdown (which is the metric we currently use to evaluate the overall stability of a release).

wlach commented 6 years ago

The code to accumulate content_crashes and content_shutdown_crashes numbers is here:

https://github.com/mozilla/telemetry-streaming/blob/380a436c1cee2671cc2251fcfbe5bf1c963ccef7/src/main/scala/com/mozilla/telemetry/streaming/ErrorAggregator.scala#L277

As far as I can tell from reading the source (e.g. https://searchfox.org/mozilla-central/rev/97cb0aa64ae51adcabff76fb3b5eb18368f5f8ab/dom/ipc/ContentParent.cpp#3137 ; https://searchfox.org/mozilla-central/rev/97cb0aa64ae51adcabff76fb3b5eb18368f5f8ab/ipc/glue/CrashReporterHost.cpp#280) the two types are being tracked completely seperately. @chutten could probably confirm.

I have mentioned this to a few people already, but I think it's best to look at the different measures in isolation to gain a holistic picture of how a release is doing rather than trying to come up with a magic number for release health. Some work has been done to do this in the channel/platform summary page already, my plan was to surface some of the most salient/important information on the front page as well.

chutten commented 6 years ago

content_crashes is the total number of SUBPROCESS_CRASHES_WITH_DUMP from the content process. content_shutdown_crashes is the number of SUBPROCESS_KILL_HARD with the reason ShutDownKill.

M+C-S (main plus content minus content shutdown) has been used as a "crashes that users are likely to care about" figure for as long as e10s has been a thing (so, about two years). It is generally the "crash" category of any release health metric.

content_crashes on its own isn't a useful measure of release health. Users don't seem to care about shutdown crashes at all (they often happen completely transparently to users), and shutdown crash rates are wildly variable from day to day. content_crashes is only useful for determining user-facing content crashes when the content_shutdown_crashes are removed from them.

I concur broadly with the "there is no magic number for release health" but in this case we must remove the shutdown crashes from the content crashes figure if we are to understand the impact to users. Or perhaps a view with non-shutdown content crashes on the bottom of the area plot, and the shutdown crashes on top.

wlach commented 6 years ago

I think we might want to fix this within telemetry-streaming so that the content_crashes excludes content_shutdown_crashes. I'd really rather not have to do a bunch of UI magic inside mission control to handle this distinction (which IMO users shouldn't have to care about).

wlach commented 6 years ago

Update: we are going to start tracking content crashes and content shutdown crashes as seperate entities in our underlying dataset. Closing this.

https://bugzilla.mozilla.org/show_bug.cgi?id=1453485