heroku / barnes

Tell StatsD about request time, GC, objects and more. Latest Rails 4 and Ruby 2.1 support, and ancient Rails 2 and Ruby 1.8 support.
MIT License
0 stars 0 forks source link

Crashing #16

Closed freeformz closed 6 years ago

freeformz commented 6 years ago

As reported by a beta user:

May 11 02:27:15 act-api app/web.1: #<Thread:0x0000000005317db0@/app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:39 run> terminated with exception (report_on_exception is true): 
May 11 02:27:15 act-api app/web.1: /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `+': nil can't be coerced into Integer (TypeError) 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `reduce' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `instrument!' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:46:in `block in instrument!' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:45:in `each' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:45:in `instrument!' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:58:in `block (3 levels) in initialize' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:57:in `each' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:57:in `block (2 levels) in initialize' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:46:in `loop' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:46:in `block in initialize' 
May 11 02:27:15 act-api app/web.1: [59] ! Detected parent died, dying 
May 11 02:27:15 act-api app/web.1: Exiting 
May 11 02:27:15 act-api app/web.1: /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `+': nil can't be coerced into Integer (TypeError) 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `reduce' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `instrument!' 
May 11 02:27:15 act-api app/web.1: [114] ! Detected parent died, dying 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:46:in `block in instrument!'
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:45:in `each' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:45:in `instrument!' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:58:in `block (3 levels) in initialize' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:57:in `each' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:57:in `block (2 levels) in initialize' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:46:in `loop' 
May 11 02:27:15 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:46:in `block in initialize'
May 11 02:27:37 act-api app/web.1: #<Thread:0x00000000042a11e8@/app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:39 run> terminated with exception (report_on_exception is true): 
May 11 02:27:37 act-api app/web.1: /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `+': nil can't be coerced into Integer (TypeError) 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `reduce' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `instrument!' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:46:in `block in instrument!' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:45:in `each' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:45:in `instrument!' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:58:in `block (3 levels) in initialize' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:57:in `each' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:57:in `block (2 levels) in initialize' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:46:in `loop' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:46:in `block in initialize' 
May 11 02:27:37 act-api app/web.1: #<Thread:0x0000000001c0eff8@/app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/cluster.rb:276 run> terminated with exception (report_on_exception is true): 
May 11 02:27:37 act-api app/web.1: /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/cluster.rb:285:in `write': Broken pipe (Errno::EPIPE) 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/cluster.rb:285:in `<<' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/cluster.rb:285:in `block in worker' 
May 11 02:27:37 act-api app/web.1: [60] ! Detected parent died, dying 
May 11 02:27:37 act-api app/web.1: [55] ! Detected parent died, dying 
May 11 02:27:37 act-api app/web.1: Exiting 
May 11 02:27:37 act-api app/web.1: /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `+': nil can't be coerced into Integer (TypeError) 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `reduce' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/instruments/puma_backlog.rb:41:in `instrument!' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:46:in `block in instrument!' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:45:in `each' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/panel.rb:45:in `instrument!' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:58:in `block (3 levels) in initialize' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:57:in `each' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:57:in `block (2 levels) in initialize' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:46:in `loop' 
May 11 02:27:37 act-api app/web.1:  from /app/vendor/bundle/ruby/2.5.0/gems/barnes-0.0.4/lib/barnes/periodic.rb:46:in `block in initialize' 
schneems commented 6 years ago

Should be fixed in https://github.com/heroku/barnes/pull/15/files#diff-4ae3ffa8e6c4911197af16f29bed93c6R40 where I default to a value of 0 if the value is nil.

I'm waiting on verification from opex that the metric is being sent on the issuetriage app before I merge that. I have a PR to https://github.com/heroku/metaas/pull/1371. I don't know what else is needed. I'll be out again next week.

freeformz commented 6 years ago

@schneems What is stopping us from breaking that out to address this issue and not tie it to the other stuff?

mimen commented 6 years ago

@schneems What's the status on this?

@freeformz is this still happening?

freeformz commented 6 years ago

The user rolled back and I assume this is still a potential issue as heroku/barnes#15 is still open. Let's wait for @schneems to respond.

bgentry commented 6 years ago

I haven't tried again yet. Would like some assurance first that the error handling is robust enough that it won't take down my app for hours at a time while I sleep 😉

Please lmk when and I'll give it another shot.

schneems commented 6 years ago

Sorry for the delay here. And sorry for the failure. I've released barnes 0.0.5 which includes https://github.com/heroku/barnes/pull/15.

In the PR I'm explicitly checking for the existence of the metric before we try to add the values. It should fix the issue you were seeing. Give it a shot and let me know.

swrobel commented 6 years ago

@schneems I hate to call you out because of all of the good work you do, but some communication that 0.0.5 was a critical update would've been appreciated. Our app unexpectedly crashed this morning and barnes was at fault. I had looked at outdated gems yesterday, seen that barnes had an update, looked at the changelog, saw it wasn't anything critical and decided to defer. Also, since this is a beta, I'm sure you could've emailed all beta customers to let them know the upgrade was critical. This was quite frustrating.

schneems commented 6 years ago

Sorry about that. Do you have any details on the failure mode? Any logs would be helpful.

In the future I can try to rework the logic so that collection behavior happens locally but does not send to statsd. This way future errors or bugs could be detected locally before they make their way to production.

swrobel commented 6 years ago

Looking over the logs, it seems to have started happening after I promoted a release that, among other things, added PumaWorkerKiller, and the crashes seem to be timed with its rolling restarts. In the final of the 3 crashes, instead of restarting the app like in the previous 2, the app remained in a crashed state. This doesn't seem to be happening anymore after I upgraded to Barnes 0.0.5 (still using PWK).

schneems commented 6 years ago

It looks like the failure mode is with puma worker killer or at least with a race condition involving processes being killed off. I previously didn't understand why @bgentry was getting the results from 0.0.4, but that makes a bit more sense now.

Basically what I think is happening is one of the workers restarts, before it can the parent process calls Puma.stat which then inspects the results of each of the workers. For a brief time that value is not initialized and so it returns nil (I think this is what happens, though I don't have a great way to reproduce or test).

Prior to your post I've only seen this fail twice and didn't know why. I have been running 0.0.4 on CodeTriage for quite some time with no crashes or issues. I believed it to be stable, though containing a potential bug (of which I didn't understand the failure mode).

Sorry for not communicating better, especially in the 0.0.5 changelog. In this case, I don't think that having barnes run in development would have surfaced the issue before being deployed to production.

swrobel commented 6 years ago

Are you using PWK on CodeTriage? I don't see any barnes errors in my logs, but it doesn't seem to be reporting data to metrics. I did see data after the last deploy with 0.0.5, but now it's saying "insufficient data" and graphs aren't rendering for the last 24hrs. Seems maybe the PWK restarts are stopping collection altogether? Not sure...

schneems commented 6 years ago

Do you have Barnes.start in your after boot block? I’m not currently using PWK on CodeTriage.

If you open a ticket help.heroku.com and search for “ruby metrics” and send me the ticket number then I can dig in a bit further with app access. On Tue, Jun 12, 2018 at 3:13 PM Stefan Wrobel notifications@github.com wrote:

Are you using PWK on CodeTriage? I don't see any barnes errors in my logs, but it doesn't seem to be reporting data to metrics. I did see data after the last deploy with 0.0.5, but now it's saying "insufficient data" and graphs aren't rendering for the last 24hrs. Seems maybe the PWK restarts are stopping collection altogether? Not sure...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/heroku/barnes/issues/16#issuecomment-396718430, or mute the thread https://github.com/notifications/unsubscribe-auth/AADpYE7xgL3Edsr_BCxPcoYJivn-zmlnks5t8CEGgaJpZM4UBfG_ .

schneems commented 6 years ago

Anyone else seeing the currently released 0.0.5 client crash? I think I fixed it.

freeformz commented 6 years ago

I'm going to mark this as closed as we believe it's fixed. We can create a new issue or re-open this one if it comes up again.