Closed tjtelan closed 4 years ago
I just added a notification to one of my repos (to a Slack DM instead of a channel, which I don't think matters). Only one notification, so far. I'll check back in if this changes, but otherwise I think this is going to be hard to repo.
It seems to be this particular repo causing problems. It is the only one with misfiring, multiple Slack notifications.
I looked at the werker logs during its build, and there are a few NSQ errors
Feb 01 21:01:36 werker-new-doo-1 werker[10712]: {"function":"github.com/shankj3/ocelot/vendor/github.com/shankj3/go-til/nsqpb.NSQLogger.Output","level":"error","msg":"1 [build/werker] (10.111.0.28:4250) protocol error - E_TOUCH_FAILED TOUCH 0af83ac1fcadd000 failed client does not own message","time":"2019-02-01T21:01:36Z"}
And at the very end of the build:
Feb 01 21:22:21 werker-new-doo-1 werker[10712]: {"function":"github.com/shankj3/ocelot/vendor/github.com/shankj3/go-til/nsqpb.NSQLogger.Output","level":"error","msg":"1 [build/werker] (10.111.0.28:4250) protocol error - E_FIN_FAILED FIN 0af83ac1fcadd000 failed ID not in flight","time":"2019-02-01T21:22:21Z"}
And as I'm writing this, Ocelot is sending slack sent 2 FAIL, then 1 PASS message for the same commit over the last 30 min
I found that it was NSQ misbehaving again. This time I restarted nsqd, nsqlookupd, and nsqadmin. I made sure there were no lingering topics, so werker can configure all that.
I'm hoping that fixes, or at least isolates this problem. I don't see any other repos causing trouble, and this one is using a few integrations.
I'm going to bump this up to high priority, because of how often the multiple messages will switch from reporting FAIL to PASS.
Another clue: I'm noticing that start times are being re-written.
As seen from ocelot summary
| 2551 | large-repo | running | Mon Feb 4 13:30:58 | N/A | master | 475dc7a |
And a few minutes later
| 2551 | large-repo | running | Mon Feb 4 13:45:58 | N/A | master | 475dc7a |
The start time has changed by 15 minutes...
Closing since this is part of the legacy codebase.
It is possible this is related to #198, and maybe #158, but in this situation, I only see that a single build was queued on this commit.
For context, I've recently restarted all of my nsqd nodes.
I quickly made a build, and then checked on Slack time. Our time stamps in Ocelot make comparing time a little confusing, but for the most part the skew is within a few minutes.
Ocelot reports this build started at
Thu Jan 31 17:48:08
, build-id2537
, and ran for23 minutes and 25 seconds
Slack notifications fired off for the same build at the following timestamps:
It is curious how the notification start way before the build start time...