Closed tjtelan closed 5 years ago
Nsq actually creates a topic when the first message goes through a channel, is that what you mean
That's why there is the waiting behavior
I quickly tried to repo this by removing the nsq setup in the ocelot_vagrant_dev role, and got different behavior after waiting a minute (The expected experience is that this would never leave the queued for build state). I'll leave what I originally wrote below for reference, and lower the priority.
But before closing issue I think I need Evan (or a person with a faster computer than mine) to try it out, since his laptop has been able to hit a bunch of other race conditions. The NSQ setup is what originally offered a "working" installation. Now I'm confused about explaining what we were experiencing. (I remember he waited a good 10min w/o states changing.)
Here's the happy output:
+----------+-------------+----------------+---------------------+--------+--------+------------------------------------------+
| BUILD ID | REPO | BUILD DURATION | START TIME | RESULT | BRANCH | HASH |
+----------+-------------+----------------+---------------------+--------+--------+------------------------------------------+
| 1 | ocelot-test | 0 seconds | Sat Dec 22 17:11:05 | PASS | master | ebbb462798a7ca3ca5653d6529d8abfb93aa6103 |
+----------+-------------+----------------+---------------------+--------+--------+------------------------------------------+
(Original post)
The behavior we were seeing with the Vagrant environments was that channels would never be created. Therefore, builds would never move out of waiting to be queued.
This can be repo'ed now if you comment out this line before running vagrant up
, then try to add a vcs + ocelot build
https://github.com/level11consulting/ocelot/blob/fba713c6dc4d205d4ef2c6f1dcb5acee318d4766/deploy/roles/vagrant_ocelot_dev/tasks/main.yml#L13
and following the rest of instructions for how to start admin/werker/hookhandler + https://github.com/level11consulting/ocelot/pull/173#issue-238874673
Example output:
+----------+-------------+------------------+------------+--------+--------+------------------------------------------+
| BUILD ID | REPO | BUILD DURATION | START TIME | RESULT | BRANCH | HASH |
+----------+-------------+------------------+------------+--------+--------+------------------------------------------+
| 1 | ocelot-test | queued for build | N/A | N/A | master | ebbb462798a7ca3ca5653d6529d8abfb93aa6103 |
+----------+-------------+------------------+------------+--------+--------+------------------------------------------+
There are a lot of things that could lead to a queues build not being picked up, the logs would be able to tell If a topic still hasn't been created, you'd see in the worker every 10s a log saying the topic isn't created Is that the behavior that is occurring?
Yup. I don't have logs handy to copy/paste, but the werker stdout polls every 10s looking for a build
topic. And sometimes even when the topic is created, werker still complains and looking for the topic until the channel is created. The race here seems way less complicated, since it all depends on NSQ being configured.
Was there any particular reason why werker also doesn't call nsqpb.GetInitProducer()
to create the topic? The call seems to be idempotent, but maybe I'm wrong?
Because right now werker doesn't create messages, and also calling that I am pretty sure doesnt actually create the topic/channel. Sending the first message creates the topic, so when the first build is queued the topic is created. Are you saying you have seen different behavior?
Does this lead to builds not queuing / not being picked up? Because if not I don't see what the issue is
This issue is entirely about how builds might never leave queued for build
state on Ocelot's very first build. Since Ocelot is still set up on VMs by hand, this race condition is not tied to vagrant-specific automation.
I'm unable to repo it on my own, but when Evan was onboarding (with his newer and faster than my own macbook, which I believe is an important detail), he would have builds stuck in a werker-waiting-for-topic spinlock unless we created the topic/channels beforehand. We spent a day troubleshooting before manually creating the topic + channels in order to see some build success.
I think we should be able to detect this condition and fix it during runtime, but I'm trying to pick apart the steps to try and write a test to set up the scenario.
Yeah, I've seen behavior where on first start of Ocelot, you queue a build and it sits waiting forever. Restarting the services gets it to finally build and finish. But sometimes, for reasons I've yet to fully understand, there's some deadlock somewhere that keeps builds from running until you restart Ocelot.
Weird, as in prod and Dev environments there has never been a manual creation of a topic, even when all the vms had to be torn down and recreated. If you go to nsq admin view in this condition does it say that the message is still in the queue? did you turn on debugging for the nsqd container? If it's still in the sleep / lookup topic state and the topic exists thats disturbing as that's close to a direct API call from what I remember
I tried to repro this issue and failed. I deleted the line in deploy/roles/vagrant_ocelot_dev.yml
that imports the nsq_setup.yml task.
I think this was my fault. I believe this was observed during my onboarding while TJ and I were refactoring the deployment scripts. I probably saw this in some case where the nsq port wasn't set correctly and ocelot/werker couldn't communicate with nsq. I probably fixed it and didn't realize it, and that's when the build kicked in.
I'm closing this, but if I see this behavior again, I'll reopen with more information.
I was trying to figure out why builds wouldn't work unless I created the topic/channel manually. I even added an ansible task in the vagrant_ocelot_dev role.
nsqpb.GetInitProducer() seems to be responsible for the creation of the topic. Topic creation is getting called from multiple.
https://github.com/level11consulting/ocelot/blob/fba713c6dc4d205d4ef2c6f1dcb5acee318d4766/cmd/hookhandler/main.go#L60
This gets run when main.admin runs admin.GetGrpcServer() https://github.com/level11consulting/ocelot/blob/fba713c6dc4d205d4ef2c6f1dcb5acee318d4766/router/admin/guideocelotstruct.go#L45
https://github.com/level11consulting/ocelot/blob/fba713c6dc4d205d4ef2c6f1dcb5acee318d4766/cmd/changecheck/main.go#L69
Channel creation is not happening with
admin
,werker
orhookhandler
, as far as I am experiencing with the Vagrant environment.This call from werker https://github.com/level11consulting/ocelot/blob/fba713c6dc4d205d4ef2c6f1dcb5acee318d4766/cmd/werker/main.go#L67
Calls this area in go-til that calls the nsq.NewConsumer() https://github.com/shankj3/go-til/blob/01d180352f27ceecf95e442f8f73f113f1980841/nsqpb/consume.go#L148
However, you can only reach that section of code if the topic has been created...
If we're creating the topic from multiple areas, then why are we seeing this behavior?