orbitalci / orbital

Orbital is a self-hosted CI system for solo/small consulting dev teams. Written in Rust.
GNU General Public License v3.0
29 stars 2 forks source link

nsqd cluster operational issues may cause `ocelot log` streaming output to stop working during builds #198

Closed tjtelan closed 5 years ago

tjtelan commented 5 years ago

Builds are going pretty slowly at times. My build is in running state, and been running for 3 minutes, but ocelot logs gives me nothing useful:

$ ocelot logs
no -hash flag passed, using detected hash ff84c635f09a21f7626f0632a4cb33994242a525
ff84c635f09a21f7626f0632a4cb33994242a525

And when I ssh into the werker VM, and look at the docker logs I only see this:

werker:~$ docker logs <containerid>
Connecting to 172.17.0.1:9090 (172.17.0.1:9090)
Connecting to s3-us-west-2.amazonaws.com (52.218.212.72:443)
do_things.tar        100% |*******************************|  9728   0:00:00 ETA

Then all of a sudden, the container was gone, and I had logs that I could see throughocelot logs. Other devs are also complaining about builds behaving unpredictably. Kind of fuzzy evidence...

shankj3 commented 5 years ago

what are the logs you can see once it closes out?

tjtelan commented 5 years ago

Regular container logs. All the stdout/stderr content. It is like the output buffer stopped flushing.

ocelot logs pretty much doesn't work for streaming logs at the moment. Only after the build has completed.

shankj3 commented 5 years ago

whaaat does it work locally

shankj3 commented 5 years ago

also what are the logs of admin when you try to stream? does it route to the right worker host? also if it does is there anything in the werker logs

tjtelan commented 5 years ago

Based on @shankj3 suggestion offline, I restarted admin, and werker with the environment var DEBUGGIT=1 and kicked off a new build.

Admin logs printed out the contents of the repo's ocelot.yml

And this little note, which is weird. This does reference a host running nsqd

{"function":"github.com/shankj3/ocelot/vendor/github.com/shankj3/go-til/nsqpb.NSQLogger.Output","level":"info","msg":"1 (10.111.x.y:4250) connecting to nsqd","time":"2019-01-31T20:13:04Z"}

Which got me to look at the logs for nsqd.


Now here's where I could be more careful in the future for gathering logs, since I lost them

Since there is still a little bit of preference for the first node, I looked there.

I wish I had logs to copy/paste, but I can say that all of this nsqd's logs referred to not having ownership or some other nsq internal garbage...

Since I'm running nsq in Kubernetes, I deleted the pod, and tried another test build. ocelot log output was working again!


It is entirely possible that this is an issue all within NSQ. Before closing this issue, we should have some documentation that specifies that this is a known issue that affects streaming logs, and to try to restart NSQd.

tjtelan commented 5 years ago

The werker logs had the flood of NSQ errors I was seeing. Add this to the docs:

When these are in the NSQ logs or the Werker logs, restart nsqd.

Feb 01 21:15:36 werker-new-doo-1 werker[10712]: {"connections":"1","function":"github.com/shankj3/ocelot/vendor/github.com/shankj3/go-til/nsqpb.(*ProtoConsume).NSQProtoConsume","level":"debug","messagesFinished":"6","messagesReceived":"7","messagesRequeued":"0","msg":"consumer stats","time":"2019-02-01T21:15:36Z"}
Feb 01 21:15:36 werker-new-doo-1 werker[10712]: {"function":"github.com/shankj3/ocelot/vendor/github.com/shankj3/go-til/nsqpb.NSQLogger.Output","level":"error","msg":"1 [build/werker] (10.111.0.28:4250) protocol error - E_TOUCH_FAILED TOUCH 0af83ac1fcadd000 failed ID not in flight","time":"2019-02-01T21:15:36Z"}
tjtelan commented 5 years ago

This log issue is a problem with Consul. If we are building the same hash over and over (because of failures) it seems like there's a pretty high chance that ocelot logs will choose the incorrect build to try and connect to for live logs. Clearing out the ci top-level key is extreme, but does the job.

I am printing out the Werker UUID in a branch. Thinking that will help the manual maintenance while a strategy for cleaning up will go a long way before we fix this incorrect behavior.

Closing.