rancher / community-catalog

Catalog entries contributed by the community
385 stars 641 forks source link

gitlab-multi-runner config Sidekick "couldn't execute POST" connection refused #810

Open felixschul opened 6 years ago

felixschul commented 6 years ago

Hi all,

First of all many thanks for the awesome work with rancher and the catalog items!

I am experiencing the following problem with the gitlab-multi-runner community catalog item:

When a new instance is started (in my case AWS spot instances), sometimes the "config" sidekick that registers the new runner fails with the following error message:

9.7.2018 17:11:34Running in system-mode.                           
9.7.2018 17:11:34                                                  
9.7.2018 17:11:39ERROR: Registering runner... failed                 runner=FKPDjL73 status=couldn't execute POST against [URL]: dial tcp: lookup gitlab.ambient-innovation.com on 169.254.169.250:53: read udp 10.42.109.212:55972->169.254.169.250:53: read: connection refused
9.7.2018 17:11:39PANIC: Failed to register this runner. Perhaps you are having network problems

This happens only sometimes on some of the newly started instances. When I start the config sidekick again, everything works fine.

My assumption is, that the sidekick executes the POST request a little too early, before rancher has fully built the network for the new instance. Or in other words the scheduler starts the new container (and its sidekick) before the network is fully ready. This might be related to https://github.com/rancher/rancher/issues/2621

Does anyone have any idea on how to fix this or work around this? We shut down spot instances and start new ones very frequently, so this is really a problem and a manual solution (starting the failed sidekicks manually) is not an option for me. Any help is greatly appreciated.

Further information:

Rancher Server v1.6.18 Gitlab multi runner v10.4.0 The servers are AWS t2.medium instances and run on Rancher OS v1.2.0

rawmind0 commented 6 years ago

Hi @felixschul ,

by the error message you attached, it seems you have any issue with rancher metadata infrastructure service. You are getting ..connection refused.. to metadata internal dns. Is your metadata service running correctly when you launch new gitlab-runner instances?? Are you having this issue just for this service??

It could be a race condition where new gitlab-runner instances are trying to start at new spot instances, before rancher metadata service is completely up and running on them. Could you please check it out??

felixschul commented 6 years ago

Hi @rawmind0,

Many thanks for the super-fast feedback!

I noticed the issue only for this service. But this service might well be the only service that immediately executes a request that uses the metadata service on start. Maybe other services simply need a few seconds longer to start (or to pull) and this leads to the error not appearing. Also the error does not appear every time when the gitlab runner sidekick is started on a new AWS instance. It appears about 30% of the time. I also assume that this is a race condition where the gitlab runner sidekick starts just a little bit before the metadata service is ready. I cannot see any errors in the metadata service logs, so the service works fine once it started on the new instance.

From my perspective the gitlab runner sidekick container should check if the metadata service is ready and wait for it to get ready (maybe with some kind of loop that simply includes a "sleep").

It might also help if I would find a way to make rancher wait a few seconds before starting this container, but I found no option for this.

My only idea is to contribute to the community item and write an entrypoint script that checks the metadata service and waits for it to be available.

Any other ideas?

Thanks a lot!

Felix

rawmind0 commented 6 years ago

How are you deploying new gitlab-runner instances?? Are you using rancher cli into pipeline to do it??

An option could be use rancher cli to wait until network-services is healthy again, once you deploy new spot instances.

May work for you?? :)

felixschul commented 6 years ago

Hi @rawmind0,

I am using the "gitlab-ci-multi-runner" community catalog item, which is set to "Always run one instance of this container on every host". So when a new host is connected to Rancher, the scheduler will start a container and its sidekicks automatically on the new host. So no, I am not using the rancher cli for this (or maybe I got your question wrong). Just to be sure I will outline the current process:

I do not think your above suggestion can make a difference: Before I register the new spot instance with rancher, everything is fine. As soon as I register the new instance with Rancher, scheduler will start all services on this new host at the same time. I do not know how to tell the scheduler to change the order or wait with one of the services.

Best

Felix

rawmind0 commented 6 years ago

Hi @felixschul ,

i didn't fully understand how you are doing it, but i see your point now.

Best solution should be that gitlab-ci-multi-runner sidekick take care of dns resolution. More than happy if you could contribute with it.

Anyway, in the proposal line, if you are using "gitlab-ci-multi-runner" community catalog, you could set host label for running gitlab-ci-multi-runner instances, then it always run one instance of this service on every host with this label. You may add an additional step to your user script deployment to check for network-services healthy on hosts before put the label.

felixschul commented 6 years ago

Hi @rawmind0,

Sorry for my late reply. I think setting a label after checking network services is a good idea. However I think it would be cleaner to make the gitlab runner sidekick wait for the services. I will check if I can contribute to this. Thanks for your support. I suggest to leave this issue open as I believe that this is really a problem with the "gitlab-multi-runner" catalog item, at least under certain circumstances. But feel free to close it if you judge this otherwise.