jenkinsci / docker-plugin

Jenkins cloud plugin that uses Docker
https://plugins.jenkins.io/docker-plugin/
MIT License
487 stars 317 forks source link

Container cannot connect to node because it doesn't exist #757

Closed mat1e closed 3 years ago

mat1e commented 4 years ago

We recently updated our version of Jenkins to 2.176.3. And now a connection error with docker-agent randomly block the queue of jobs :

Refusing headers from remote: Unknown client name: docker-00026wu6nor9w

The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

I saw in the code of docker-plugin that the container is created and started before the Jenkins node. While the connection method is JNLP, the commands to download and run the remoting.jar are executed at the start of the container. But at this moment, the node wasn't added to Jenkins master.

Have you ever encountered this error? Is there a solution?

Is it possible to modify provision methods and create the Jenkins node before instanciate the container to fix this issue?

Jenkins version : 2.176.3 docker-plugin version : 1.1.7 docker engine version : 1.13.1

mat1e commented 4 years ago

We are actually testing "Attach Docker container" wich seems to be a solution. But, in the documentation of the plugin, this functionnality is marked as experimental. Is this still the case ?

pjdarton commented 4 years ago

It's the newest option. When it was originally introduced it had a bug (#628) that caused it to fail sometimes if the network wasn't perfect between Jenkins & docker, but that's now been fixed (in PR #693). I am not aware of there being any other problems with it, but it hasn't had the same number of years of use that the other methods have.

I would suggest that you continue to use it and, if you do find bugs, you raise a bug report (providing awesome levels of clarity to make it easy to fix) ... and ideally also follow that with a PR that fixes the issue too.

pjdarton commented 4 years ago

As for fixing the problem you saw with JNLP, it may be curable by changing the plugin code, but it's a tricky one. It's tricky because, until the container has started, we don't know everything we need to know about the slave node we return to Jenkins ... but, if the container runs very quickly and has no retry logic inside it then we need to pass the slave node to Jenkins before we start the container. i.e. both events have reasons why they need to happen before the other one...

mat1e commented 4 years ago

Thank you for your help. I tried to change code by myself and indeed, it is tricky and have a lot of impacts on others methods of connection. We tested "Attach Docker container" and it seems to solve our problem. So, we decide to gradually migrate our Docker Templates to this method. I will open bug report and suggest PR if we encounter any problem with it. For now I close this issue.

akomakom commented 4 years ago

@pjdarton can you point me in the direction of this fix? I wasn't able to find it in jenkins jira.

When it was originally introduced it had a bug that caused it to fail sometimes if the network wasn't perfect between Jenkins & docker, but that's now been fixed.

pjdarton commented 4 years ago

I've updated my comment (above) to include links.

akomakom commented 4 years ago

I've updated my comment (above) to include links.

I've seen other exceptions while using Attach, so I'm sticking with JNLP. For the record, I am using this workaround for this "node doesn't exist" problem with JNLP, and it's (so far) more reliable than Attach.

pjdarton commented 4 years ago

So, I've been thinking ... and I suspect that there may be a few things we can do to improve this.

  1. Make the container-id field in the slave class settable after it's created (if it isn't already). The setter could throw if it's set twice, but we need to be able to create the slave node instance and then set the container id later.
  2. Ensure that all code that gets the container-id copes with it being null, mark the field as "might be null" etc. That'll affect the garbage-collection code and the UI jelly code, possibly more.
  3. Create the slave node instance without the container-id.
  4. Change the docker plugin's logic so that, for a JNLP-connected slave, the "pre-creation" logic adds the slave node to Jenkins before the container-id is known.
  5. Set the container-id the moment it's known (i.e. immediately after the "docker run" command completes).

In parallel with that, we:

  1. work on a separate PR for the official JNLP docker slave to add support for configurable retries (non-negative integer defaulting to none) and retry-delay (decimal number defaulting to 1 second and permitting both larger and smaller numbers but excluding zero) and, if that gets a favorable reception, we ...
  2. ... enhance this plugin with support for that configuration and have this plugin default to 20 retries half a second apart for newly-configured templates (i.e. existing templates should remain "no retries").
  3. ...and ensure that the new fields have nice help-text explaining what they do ... and ensuring that the "advanced" JNLP options for a custom start command also supports these too.

Disclaimer: When I say "we" I don't mean I'm promising to drive this myself (I've got a ton of not-docker-plugin work to do so I'm going to have to keep my time on this to a minimum) and, as you've got a decent workaround that works for you, I'm not really expecting you to do it either ... but if anyone has the time & inclination to sort this then I am happy to discuss/consult/advise/review changes.

Re: "I've seen other exceptions while using Attach" Please provide details! While I understand that it isn't ideal for some use-cases, there are plenty of users for whom it is the best option (e.g. it's the only one that's guaranteed to work "out of the box" regardless of network configuration), so if there's issues with it I'd like to be aware of them...

mat1e commented 4 years ago

@pjdarton, is there a work in progress on an update of JNLP connection ? I've a workaround on it but I think we don't need to make the container-id field nullable because we get it at the creation. The script for JNLP is called at the start of the container so we already got the container-id. The update may be to add the node to Jenkins before the container start and not before its creation.

createContainer --> addNode --> startContainer.

I would like your opinion on it.

pjdarton commented 4 years ago

Yes, I agree that what we need is to "add node" before the container starts ... but for JNLP only (e.g. for SSH we add the node after the container has started and its SSH port is open for business).

As for "progress", my only "progress" on this was to post my thoughts on the subject (above). Been too busy on other things. I did, at one point, fix the unit-tests though so the main CI build is no longer failing, so if you've got a suggested fix then please submit a PR - it should all stay "green". ...and it should stay green even if you add a SSH-connection test that has no retries and connects just once and connects immediately Jenkins gets the node added. That'll prove that your changes haven't improved JNLP at the expense of SSH. Ideally, any PR for this would also add a unit-test that demonstrates how the existing code doesn't work so that, if anyone undoes the fix, we get a unit-test failure. At present, despite this issue with the code, the unit-tests are green. I'd guess that, on the cloudbees CI Jenkins, it takes sufficiently long for the CI server's docker daemon to start the container that Jenkins has added the node before the container tries to connect to Jenkins. If we could figure out a unit-test that fails if we don't add the node before we start the container then such a test would be able to prove this issue is fixed...

...but, just to set expectations: I'm off soon, just 2 working days left this year (with those already 100% committed to other tasks), so I'm unlikely to do much until mid-Jan at the earliest.

mat1e commented 4 years ago

@pjdarton, I have a workaround at this repository. I would like your opinion on this solution. For the moment, I have not found any other way to do otherwise without impacting attach and jnlp code. I need to do more tests but this solution is simple and seems to be working.