Closed freeformz closed 6 years ago
Should be fixed in https://github.com/heroku/barnes/pull/15/files#diff-4ae3ffa8e6c4911197af16f29bed93c6R40 where I default to a value of 0 if the value is nil.
I'm waiting on verification from opex that the metric is being sent on the issuetriage
app before I merge that. I have a PR to https://github.com/heroku/metaas/pull/1371. I don't know what else is needed. I'll be out again next week.
@schneems What is stopping us from breaking that out to address this issue and not tie it to the other stuff?
@schneems What's the status on this?
@freeformz is this still happening?
The user rolled back and I assume this is still a potential issue as heroku/barnes#15 is still open. Let's wait for @schneems to respond.
I haven't tried again yet. Would like some assurance first that the error handling is robust enough that it won't take down my app for hours at a time while I sleep 😉
Please lmk when and I'll give it another shot.
Sorry for the delay here. And sorry for the failure. I've released barnes 0.0.5 which includes https://github.com/heroku/barnes/pull/15.
In the PR I'm explicitly checking for the existence of the metric before we try to add the values. It should fix the issue you were seeing. Give it a shot and let me know.
@schneems I hate to call you out because of all of the good work you do, but some communication that 0.0.5 was a critical update would've been appreciated. Our app unexpectedly crashed this morning and barnes was at fault. I had looked at outdated gems yesterday, seen that barnes had an update, looked at the changelog, saw it wasn't anything critical and decided to defer. Also, since this is a beta, I'm sure you could've emailed all beta customers to let them know the upgrade was critical. This was quite frustrating.
Sorry about that. Do you have any details on the failure mode? Any logs would be helpful.
In the future I can try to rework the logic so that collection behavior happens locally but does not send to statsd. This way future errors or bugs could be detected locally before they make their way to production.
Looking over the logs, it seems to have started happening after I promoted a release that, among other things, added PumaWorkerKiller, and the crashes seem to be timed with its rolling restarts. In the final of the 3 crashes, instead of restarting the app like in the previous 2, the app remained in a crashed state. This doesn't seem to be happening anymore after I upgraded to Barnes 0.0.5 (still using PWK).
It looks like the failure mode is with puma worker killer or at least with a race condition involving processes being killed off. I previously didn't understand why @bgentry was getting the results from 0.0.4, but that makes a bit more sense now.
Basically what I think is happening is one of the workers restarts, before it can the parent process calls Puma.stat
which then inspects the results of each of the workers. For a brief time that value is not initialized and so it returns nil (I think this is what happens, though I don't have a great way to reproduce or test).
Prior to your post I've only seen this fail twice and didn't know why. I have been running 0.0.4 on CodeTriage for quite some time with no crashes or issues. I believed it to be stable, though containing a potential bug (of which I didn't understand the failure mode).
Sorry for not communicating better, especially in the 0.0.5 changelog. In this case, I don't think that having barnes run in development would have surfaced the issue before being deployed to production.
Are you using PWK on CodeTriage? I don't see any barnes errors in my logs, but it doesn't seem to be reporting data to metrics. I did see data after the last deploy with 0.0.5, but now it's saying "insufficient data" and graphs aren't rendering for the last 24hrs. Seems maybe the PWK restarts are stopping collection altogether? Not sure...
Do you have Barnes.start in your after boot block? I’m not currently using PWK on CodeTriage.
If you open a ticket help.heroku.com and search for “ruby metrics” and send me the ticket number then I can dig in a bit further with app access. On Tue, Jun 12, 2018 at 3:13 PM Stefan Wrobel notifications@github.com wrote:
Are you using PWK on CodeTriage? I don't see any barnes errors in my logs, but it doesn't seem to be reporting data to metrics. I did see data after the last deploy with 0.0.5, but now it's saying "insufficient data" and graphs aren't rendering for the last 24hrs. Seems maybe the PWK restarts are stopping collection altogether? Not sure...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/heroku/barnes/issues/16#issuecomment-396718430, or mute the thread https://github.com/notifications/unsubscribe-auth/AADpYE7xgL3Edsr_BCxPcoYJivn-zmlnks5t8CEGgaJpZM4UBfG_ .
Anyone else seeing the currently released 0.0.5 client crash? I think I fixed it.
I'm going to mark this as closed as we believe it's fixed. We can create a new issue or re-open this one if it comes up again.
As reported by a beta user: