orbitalci / orbital

Orbital is a self-hosted CI system for solo/small consulting dev teams. Written in Rust.
GNU General Public License v3.0
29 stars 2 forks source link

NSQ topic/channels not always being created on startup #189

Closed tjtelan closed 5 years ago

tjtelan commented 5 years ago

I was trying to figure out why builds wouldn't work unless I created the topic/channel manually. I even added an ansible task in the vagrant_ocelot_dev role.

nsqpb.GetInitProducer() seems to be responsible for the creation of the topic. Topic creation is getting called from multiple.

https://github.com/level11consulting/ocelot/blob/fba713c6dc4d205d4ef2c6f1dcb5acee318d4766/cmd/hookhandler/main.go#L60

This gets run when main.admin runs admin.GetGrpcServer() https://github.com/level11consulting/ocelot/blob/fba713c6dc4d205d4ef2c6f1dcb5acee318d4766/router/admin/guideocelotstruct.go#L45

https://github.com/level11consulting/ocelot/blob/fba713c6dc4d205d4ef2c6f1dcb5acee318d4766/cmd/changecheck/main.go#L69


Channel creation is not happening with admin, werker or hookhandler, as far as I am experiencing with the Vagrant environment.

This call from werker https://github.com/level11consulting/ocelot/blob/fba713c6dc4d205d4ef2c6f1dcb5acee318d4766/cmd/werker/main.go#L67

Calls this area in go-til that calls the nsq.NewConsumer() https://github.com/shankj3/go-til/blob/01d180352f27ceecf95e442f8f73f113f1980841/nsqpb/consume.go#L148


However, you can only reach that section of code if the topic has been created...

If we're creating the topic from multiple areas, then why are we seeing this behavior?

shankj3 commented 5 years ago

Nsq actually creates a topic when the first message goes through a channel, is that what you mean

shankj3 commented 5 years ago

That's why there is the waiting behavior

tjtelan commented 5 years ago

I quickly tried to repo this by removing the nsq setup in the ocelot_vagrant_dev role, and got different behavior after waiting a minute (The expected experience is that this would never leave the queued for build state). I'll leave what I originally wrote below for reference, and lower the priority.

But before closing issue I think I need Evan (or a person with a faster computer than mine) to try it out, since his laptop has been able to hit a bunch of other race conditions. The NSQ setup is what originally offered a "working" installation. Now I'm confused about explaining what we were experiencing. (I remember he waited a good 10min w/o states changing.)

Here's the happy output:

+----------+-------------+----------------+---------------------+--------+--------+------------------------------------------+
| BUILD ID |    REPO     | BUILD DURATION |     START TIME      | RESULT | BRANCH |                   HASH                   |
+----------+-------------+----------------+---------------------+--------+--------+------------------------------------------+
| 1        | ocelot-test | 0 seconds      | Sat Dec 22 17:11:05 | PASS   | master | ebbb462798a7ca3ca5653d6529d8abfb93aa6103 |
+----------+-------------+----------------+---------------------+--------+--------+------------------------------------------+

(Original post)

The behavior we were seeing with the Vagrant environments was that channels would never be created. Therefore, builds would never move out of waiting to be queued.

This can be repo'ed now if you comment out this line before running vagrant up, then try to add a vcs + ocelot build https://github.com/level11consulting/ocelot/blob/fba713c6dc4d205d4ef2c6f1dcb5acee318d4766/deploy/roles/vagrant_ocelot_dev/tasks/main.yml#L13

and following the rest of instructions for how to start admin/werker/hookhandler + https://github.com/level11consulting/ocelot/pull/173#issue-238874673

Example output:

+----------+-------------+------------------+------------+--------+--------+------------------------------------------+
| BUILD ID |    REPO     |  BUILD DURATION  | START TIME | RESULT | BRANCH |                   HASH                   |
+----------+-------------+------------------+------------+--------+--------+------------------------------------------+
| 1        | ocelot-test | queued for build | N/A        | N/A    | master | ebbb462798a7ca3ca5653d6529d8abfb93aa6103 |
+----------+-------------+------------------+------------+--------+--------+------------------------------------------+
shankj3 commented 5 years ago

There are a lot of things that could lead to a queues build not being picked up, the logs would be able to tell If a topic still hasn't been created, you'd see in the worker every 10s a log saying the topic isn't created Is that the behavior that is occurring?

tjtelan commented 5 years ago

Yup. I don't have logs handy to copy/paste, but the werker stdout polls every 10s looking for a build topic. And sometimes even when the topic is created, werker still complains and looking for the topic until the channel is created. The race here seems way less complicated, since it all depends on NSQ being configured.

Was there any particular reason why werker also doesn't call nsqpb.GetInitProducer() to create the topic? The call seems to be idempotent, but maybe I'm wrong?

shankj3 commented 5 years ago

Because right now werker doesn't create messages, and also calling that I am pretty sure doesnt actually create the topic/channel. Sending the first message creates the topic, so when the first build is queued the topic is created. Are you saying you have seen different behavior?

shankj3 commented 5 years ago

Does this lead to builds not queuing / not being picked up? Because if not I don't see what the issue is

tjtelan commented 5 years ago

This issue is entirely about how builds might never leave queued for build state on Ocelot's very first build. Since Ocelot is still set up on VMs by hand, this race condition is not tied to vagrant-specific automation.

I'm unable to repo it on my own, but when Evan was onboarding (with his newer and faster than my own macbook, which I believe is an important detail), he would have builds stuck in a werker-waiting-for-topic spinlock unless we created the topic/channels beforehand. We spent a day troubleshooting before manually creating the topic + channels in order to see some build success.

I think we should be able to detect this condition and fix it during runtime, but I'm trying to pick apart the steps to try and write a test to set up the scenario.

eccramer commented 5 years ago

Yeah, I've seen behavior where on first start of Ocelot, you queue a build and it sits waiting forever. Restarting the services gets it to finally build and finish. But sometimes, for reasons I've yet to fully understand, there's some deadlock somewhere that keeps builds from running until you restart Ocelot.

shankj3 commented 5 years ago

Weird, as in prod and Dev environments there has never been a manual creation of a topic, even when all the vms had to be torn down and recreated. If you go to nsq admin view in this condition does it say that the message is still in the queue? did you turn on debugging for the nsqd container? If it's still in the sleep / lookup topic state and the topic exists thats disturbing as that's close to a direct API call from what I remember

eccramer commented 5 years ago

I tried to repro this issue and failed. I deleted the line in deploy/roles/vagrant_ocelot_dev.yml that imports the nsq_setup.yml task.

I think this was my fault. I believe this was observed during my onboarding while TJ and I were refactoring the deployment scripts. I probably saw this in some case where the nsq port wasn't set correctly and ocelot/werker couldn't communicate with nsq. I probably fixed it and didn't realize it, and that's when the build kicked in.

I'm closing this, but if I see this behavior again, I'll reopen with more information.