orbitalci / orbital

Orbital is a self-hosted CI system for solo/small consulting dev teams. Written in Rust.
GNU General Public License v3.0
29 stars 2 forks source link

Slack notification fires off multiple times for a single build #200

Closed tjtelan closed 4 years ago

tjtelan commented 5 years ago

It is possible this is related to #198, and maybe #158, but in this situation, I only see that a single build was queued on this commit.

For context, I've recently restarted all of my nsqd nodes.

I quickly made a build, and then checked on Slack time. Our time stamps in Ocelot make comparing time a little confusing, but for the most part the skew is within a few minutes.

Ocelot reports this build started at Thu Jan 31 17:48:08, build-id 2537, and ran for 23 minutes and 25 seconds

Slack notifications fired off for the same build at the following timestamps:

It is curious how the notification start way before the build start time...

tjtelan commented 5 years ago

I just added a notification to one of my repos (to a Slack DM instead of a channel, which I don't think matters). Only one notification, so far. I'll check back in if this changes, but otherwise I think this is going to be hard to repo.

tjtelan commented 5 years ago

It seems to be this particular repo causing problems. It is the only one with misfiring, multiple Slack notifications.

I looked at the werker logs during its build, and there are a few NSQ errors

Feb 01 21:01:36 werker-new-doo-1 werker[10712]: {"function":"github.com/shankj3/ocelot/vendor/github.com/shankj3/go-til/nsqpb.NSQLogger.Output","level":"error","msg":"1 [build/werker] (10.111.0.28:4250) protocol error - E_TOUCH_FAILED TOUCH 0af83ac1fcadd000 failed client does not own message","time":"2019-02-01T21:01:36Z"}

And at the very end of the build:

Feb 01 21:22:21 werker-new-doo-1 werker[10712]: {"function":"github.com/shankj3/ocelot/vendor/github.com/shankj3/go-til/nsqpb.NSQLogger.Output","level":"error","msg":"1 [build/werker] (10.111.0.28:4250) protocol error - E_FIN_FAILED FIN 0af83ac1fcadd000 failed ID not in flight","time":"2019-02-01T21:22:21Z"}

And as I'm writing this, Ocelot is sending slack sent 2 FAIL, then 1 PASS message for the same commit over the last 30 min

tjtelan commented 5 years ago

I found that it was NSQ misbehaving again. This time I restarted nsqd, nsqlookupd, and nsqadmin. I made sure there were no lingering topics, so werker can configure all that.

I'm hoping that fixes, or at least isolates this problem. I don't see any other repos causing trouble, and this one is using a few integrations.

tjtelan commented 5 years ago

I'm going to bump this up to high priority, because of how often the multiple messages will switch from reporting FAIL to PASS.

tjtelan commented 5 years ago

Another clue: I'm noticing that start times are being re-written.

As seen from ocelot summary

| 2551     | large-repo | running                     | Mon Feb 4 13:30:58  | N/A    | master | 475dc7a |

And a few minutes later

| 2551     | large-repo | running                     | Mon Feb 4 13:45:58  | N/A    | master | 475dc7a |

The start time has changed by 15 minutes...

tjtelan commented 4 years ago

Closing since this is part of the legacy codebase.