squarewave / bhr.html

Mozilla Public License 2.0
5 stars 2 forks source link

maybe enhancement/fork: support deriving data from crash-stats.mozilla.com #22

Open asutherland opened 7 years ago

asutherland commented 7 years ago

I was just looking at https://bugzilla.mozilla.org/show_bug.cgi?id=1399851 where we have a wide variety of crashes (crash-stat link) that were triggered at shutdown because shutdown took too long and it looks like it's a massively manual process to figure out if there are common problems/low hanging fruit/etc.

It strikes me the bhr.html UI is perfect for this problem and crash-stats can already provide the thread backtraces as JSON. It also strikes me that maybe we already have BHR reports for these hangs and so maybe we don't need to look at crash-stats?

I'm filing this issue because I'm not sure where to best raise these questions:

  1. Does BHR already have reports for these hangs and the crash reports should be ignored?
  2. If not, would it be cool / a sane idea to try and make a hacky PoC to use this UI and backend to digest that data?
  3. Has some awesome person already started doing this? Or is there an existing awesome person who already views this as part of their day job and can get to it soon? ;)

Thanks much and thanks for the awesome tool! (Aside: is there a better domain name for me to use than squarewave.github.io? Like a glamour arewenothangingyet.com?)

squarewave commented 7 years ago

My guess is that these are not present as hangs, since I think we just rely on sending hangs in profile-before-change, but don't persist them anywhere durable before that - so if we hung and crashed, it would be lost (@mystor, is this right?)

I've been thinking this interface would be good to PoC for crashes for a while, and I would be happy to give it a whirl - I should be able to plug it into the existing system fairly easily?

Regarding the domain name, nothing exists yet. Is there someone I can contact for that kind of thing, or are these generally handled as one-offs by whoever wants them?

mystor commented 7 years ago

This is true - if the browser crashes before the hang data runnable is processed on the main thread then its data is lost forever right now. In the parent process, we might be able to write the hang data to a file on disk as we detect the hang and transmit them on next startup, but in the content process it might be a bit trickier, (I am not sure when during content process shutdown we, for example, disconnect PBackground).

@asutherland It might be interesting to detect that a shutdown is starting, and start sampling every 20 or so ms, writing the stacks out to a file, clean it up if the shutdown succeeds, and if it takes too long then transmit the data we collected on the next startup? It'd have to be a (slightly) separate mechanism from core BHR probably, but we might be able to re-use the hangs.html interface.

asutherland commented 7 years ago

I've been thinking this interface would be good to PoC for crashes for a while, and I would be happy to give it a whirl - I should be able to plug it into the existing system fairly easily?

I'm unsure if you're expressing hope here that you can easily reuse this codebase, or if you're asking a question about integrating with crash-stats.mozilla.com. I am but a humble occasional user of crash-stats who noticed that the "raw dump" page has JSON (example) on it that presumably can be pulled out. Also that there's an "API Tokens" link at the bottom (at least when logged in), and that page links to https://crash-stats.mozilla.com/api/ which seems neat.

Regarding the domain name, nothing exists yet. Is there someone I can contact for that kind of thing, or are these generally handled as one-offs by whoever wants them?

This page on Mana is what I found for domain name registration, which is associated with this bugzilla component. I think a lot of people may end up just setting things up themselves and either expensing it (or eating the cost if they're lazy like me) because it seems easier. In general it looks like the Infrastructure & Operations bugzilla product is where you'd file a bug if you wanted something else involving IT to stand up infrastructure.

squarewave commented 7 years ago

Sorry, that was an unthinking question mark - I was expressing hope that I can easily reuse code. And I'm relatively certain that's the case. I think I should just be able to pull most of what I need from this gist: https://gist.github.com/ddurst/4cdad11ac9c30d340bfe4a5f0d6585aa#file-ping-based-top-crashers-py-L1125

squarewave commented 7 years ago

(https://arewesmoothyet.com)

squarewave commented 6 years ago

@asutherland, mind taking a look at this?

https://arewesmoothyet.com/?category=all&durationSpec=crashes

This is a 20% sample of all release channel hang reports submitted on 9/17. The graph in the link will show stats by build date, however. Is this in line with what you were thinking?

squarewave commented 6 years ago

Also, it's just a proof of concept; since it was adapted from the hang logic, the units are garbage (the relative frequencies are all we can see from this.)

asutherland commented 6 years ago

@squarewave This is amazing! I'm still digging into things, but if you have any tips on ways to more efficiently expand the tree-view, I'd appreciate them. I feel like the pre-react cleopatra UI at one point supported '*' as a way to fully expand a tree, but that doesn't seem to work anymore. If there is, pretend this was an enhancement request to have the '?' key show a list of keyboard shortcuts like github and gmail do.

Is the backing data pre-computed on the server, or is there a way for me to trigger the UI to scrape data from crash-stats for a specific signature/query (or filter down the existing data)? From my limited perspective as a developer concerned about specific subsystems I work on, that's where my greatest interest is. Although I'm confident the high level view will be invaluable for triage too!

mystor commented 6 years ago

@asutherland I think alt-click on the arrow will fully expand a subtree :-).

asutherland commented 6 years ago

Ah, thank you! By default gnome-shell sets alt to be the "window action key" which means alt+click was being intercepted by the window manager to move the window, which was less useful and disabled now. Alt-click 4eva!