Enhancement : node restart control options

jvanbouchaute commented 9 years ago

Hi Dima,

the node restart feature is great, but what we see frequently is that during a test campaign, all the grid nodes reach the save restart threshold almost at the same and all start rebooting almost at the same time (in the middle of the test campaign execution) , which is not optimal. It causes requests queued up at the hub, leads to timeouts and longer feedback cycles.

It could be improved by allowing a configurable node restart timeout value :

it defines the number of seconds (or minutes) the node waits (when idle) with rebooting after it has reached it's restart threshold. If the node is requested to execute a new test, then the timeout counter is reset again ..

By this you can make sure the node only reboots when the test campaign is over, which has less impact. It still does not prevent that the nodes are rebooted all together at nearly the same time.

Thanks for your feedback !

jakeyr commented 9 years ago

Staggering reboots would be great. Another issue I see is that the nodes reboot immediately when there are no tests running, instead of polling for some interval to see if no tests come in. We retry failed tests and often find that the retries end up queuing up because nodes in the grid have all restarted.

Also, if you have a hub that is also a node, there is no defense against reboots for that. It takes the whole grid down, potentially during tests as I mentioned above, causing "connection refused" errors in the test runner.

dimacus commented 9 years ago

Hey Jacob,

So I've been adding some better logic into rebooting the nodes in 1.10.1. It is still not perfect, but here are my assumptions:

Windows VM will reboot in 10 to 20 seconds
After current session is complete, check if node has reached a limit for test runs and if it has, wait for any current test to finish (up to 2 hours) and reboot the node. In the mean time, the node should not accept any new connections

This coupled with a PR submitted to Grid as of 2.47.0 (Where the load is evenly spread amongst the nodes, and 1 node is not running 3 tests while 3 are idle) seems to fix some of the rebooting issues.

I have taken out the check against the current queue... maybe that is a mistake. But i would have reports where a node would run 30 or 40 tests without rebooting causing the test suite to become unstable. Since the new session timeout is close to 2 mins, and if my assumption of 10 to 20 second reboot of the node is true, I feel like this is an ok compromise, but i'll be more than happy to try to figure out and test this logic more with anyone willing to help.

Finally, running the hub on the same box as the node just feels bad for many reasons.. . i would recommend having a dedicated hub box that has 512MB of RAM and nothing else running on it then to put it on the node, because at the moment I don't know how to figure out a good safe way to contain the node and hub on 1 box.

Please try grid extras 1.10.1 and latest version of the grid together and see if this improves your test suite.

ThanksS

On Tue, Aug 4, 2015 at 2:20 PM, Jacob Rosenberg notifications@github.com wrote:

Staggering reboots would be great. Another issue I see is that the nodes reboot immediately when there are no tests running, instead of polling for some interval to see if no tests come in. We retry failed tests and often find that the retries end up queuing up because nodes in the grid have all restarted.

Also, if you have a hub that is also a node, there is no defense against reboots for that. It takes the whole grid down, potentially during tests as I mentioned above, causing "connection refused" errors in the test runner.

— Reply to this email directly or view it on GitHub https://github.com/groupon/Selenium-Grid-Extras/issues/75#issuecomment-127699390 .

-Dima Kovalenko

Good judgment comes from experience, and experience comes from bad judgment. --Frederick P. Brooks

groupon / Selenium-Grid-Extras

Enhancement : node restart control options #75

-Dima Kovalenko