Closed tjtelan closed 5 years ago
what are the logs you can see once it closes out?
Regular container logs. All the stdout/stderr content. It is like the output buffer stopped flushing.
ocelot logs
pretty much doesn't work for streaming logs at the moment. Only after the build has completed.
whaaat does it work locally
also what are the logs of admin when you try to stream? does it route to the right worker host? also if it does is there anything in the werker logs
Based on @shankj3 suggestion offline, I restarted admin, and werker with the environment var DEBUGGIT=1
and kicked off a new build.
Admin logs printed out the contents of the repo's ocelot.yml
And this little note, which is weird. This does reference a host running nsqd
{"function":"github.com/shankj3/ocelot/vendor/github.com/shankj3/go-til/nsqpb.NSQLogger.Output","level":"info","msg":"1 (10.111.x.y:4250) connecting to nsqd","time":"2019-01-31T20:13:04Z"}
Which got me to look at the logs for nsqd.
Now here's where I could be more careful in the future for gathering logs, since I lost them
Since there is still a little bit of preference for the first node, I looked there.
I wish I had logs to copy/paste, but I can say that all of this nsqd's logs referred to not having ownership or some other nsq internal garbage...
Since I'm running nsq in Kubernetes, I deleted the pod, and tried another test build. ocelot log
output was working again!
It is entirely possible that this is an issue all within NSQ. Before closing this issue, we should have some documentation that specifies that this is a known issue that affects streaming logs, and to try to restart NSQd.
The werker logs had the flood of NSQ errors I was seeing. Add this to the docs:
When these are in the NSQ logs or the Werker logs, restart nsqd.
Feb 01 21:15:36 werker-new-doo-1 werker[10712]: {"connections":"1","function":"github.com/shankj3/ocelot/vendor/github.com/shankj3/go-til/nsqpb.(*ProtoConsume).NSQProtoConsume","level":"debug","messagesFinished":"6","messagesReceived":"7","messagesRequeued":"0","msg":"consumer stats","time":"2019-02-01T21:15:36Z"}
Feb 01 21:15:36 werker-new-doo-1 werker[10712]: {"function":"github.com/shankj3/ocelot/vendor/github.com/shankj3/go-til/nsqpb.NSQLogger.Output","level":"error","msg":"1 [build/werker] (10.111.0.28:4250) protocol error - E_TOUCH_FAILED TOUCH 0af83ac1fcadd000 failed ID not in flight","time":"2019-02-01T21:15:36Z"}
This log issue is a problem with Consul. If we are building the same hash over and over (because of failures) it seems like there's a pretty high chance that ocelot logs
will choose the incorrect build to try and connect to for live logs. Clearing out the ci
top-level key is extreme, but does the job.
I am printing out the Werker UUID in a branch. Thinking that will help the manual maintenance while a strategy for cleaning up will go a long way before we fix this incorrect behavior.
Closing.
Builds are going pretty slowly at times. My build is in running state, and been running for 3 minutes, but
ocelot logs
gives me nothing useful:And when I ssh into the werker VM, and look at the docker logs I only see this:
Then all of a sudden, the container was gone, and I had logs that I could see through
ocelot logs
. Other devs are also complaining about builds behaving unpredictably. Kind of fuzzy evidence...