jenkinsci / nomad-plugin

Nomad cloud plugin for Jenkins
https://plugins.jenkins.io/nomad/
MIT License
56 stars 41 forks source link

Add an ASAP provisioning strategy for faster slave bringup #6

Closed antweiss closed 8 years ago

antweiss commented 8 years ago

Resolving issue #4

iverberk commented 8 years ago

I need to test this some more. I've done some preliminary testing with firing up multiple jobs but I saw some errors and connection failures. I also started four jobs simultaneously and Jenkins fired up thirteen nodes. Not sure if this change is related to that but I will do some more testing. Nonetheless, thanks for the effort so far!

adrianlop commented 8 years ago

thank you both for investigating this! if you need help with testing please let me know

antweiss commented 8 years ago

@iverberk sounds scary! :) I didn't see anything like this, but suppose I need to do more testing of my own. Thanks for looking into this. I'll do more tests tomorrow. @adrianlop - once I'm finished with tests tomorrow, it'll be great if you can verify this too. I'll let you know when I have some findings. thanks!

iverberk commented 8 years ago

yeah, let's work this out together. In the meantime I have added support for specifying the number of executors per slave. This should help as well. Going to do further testing with this branch. Let me know your findings.

antweiss commented 8 years ago

I see the overprovisioning issue now. evidently happens if Nomad is slow to react or if slaves fail to conect - jenkins sees no available executors and provisions another batch. Need to take the number of slaves already scheduled for provisioning into account. Will push a fix when ready.

antweiss commented 8 years ago

I've added a counter for slaves pending connection. It's incremented when calling the provisioning callback and decremented when the provisioning callback exits (whether if the slave gets online, or gets terminated because it didn't connect after 1 minute). In my tests I was able to susscessfully start 15 jobs on 15 nomad slaves in parallel without overprovisioning.

I still see a problem when slaves fail to connect. For some reason it looks like the slave.terminate() call in the provisioning callback doesn't remove the offline slave from Jenkins and it gets stuck there until the idle termination timeout kicks in. @iverberk if you have an idea on how verify slave removal - it will be a great help.

@adrianlop if you can help with testing - it will be a great help. Can you build your own or would you like a link to a jpi?

adrianlop commented 8 years ago

@antweiss sure, I just built a .hpi using your branch + rebase master from @iverberk with executor changes.

It's working!! both of your changes. Minor bug: if you launch a few jobs at the same time, it will overprovision slaves:

image

not a real problem for me, I just needed true parallelism and you already gave me that :P

please let me know if you need some more specific tests. thank you both for looking into this! I really appreciate it.

antweiss commented 8 years ago

@adrianlop thanks for the testing effort! I think overprovisioning may be now caused by the number of executors. I need to take it in account when couting the pending capacity. Will retest with Ivo's changes.

adrianlop commented 8 years ago

@antweiss I kept running jobs this morning with executors=1, everything OK for now, just another minor problem (part of it was my fault): I built a custom slave docker image with a faulty CMD, so everytime you provision a slave with this img, it fails and slave status becomes "offline", because java -jar slave.jar will never run. The plugin kept creating slaves, but didn't delete the ones in "offline".

Maybe would be nice to have a different timeout param for this problem, so we don't have a huge offline slave list. what do you think?

thanks!

antweiss commented 8 years ago

@adrianlop yes - that's exactly what I'm seeing. The offline slaves get deleted only once the idle termination timeout expires even though the plugin calls terminate on them... I still need to understand why they don't get deleted immediately.

antweiss commented 8 years ago

@iverberk do you have any ideas regarding how to force the removal of slaves that fail to come online? BTW - thinking about it - this is probably something that's not related to ASAP provisioning at all. Should be occuring in the released version the same way. It does become more annoying when provsioning frequency is higher, but not directly related to it.

iverberk commented 8 years ago

@antweiss not directly, but maybe this will give us some pointers: https://stackoverflow.com/a/34914384 . You are right that the cleanup of slaves should happen regardless of the provisioning strategy. I'll try to look into into further when I get some time.

antweiss commented 8 years ago

@iverberk I actually tried calling doDelete() after terminate at some point but got an NPE... And the node stayed. wIll try to experiment more with this when I have time. Now regarding this PR - do you have any objections to merging it? In my tests it works really well. I didn't see any disconnects. And @adrianlop also confirmed it works for him. Thanks for merging the label restriction PR BTW :) I'll rebase this branch against it and push ASAP.