mhardin / SeleniumGridScaler

Selenium Grid auto scaling plugin utilizing AWS
GNU General Public License v2.0
111 stars 51 forks source link

Auotoscaled nodes gets terminated exactly at 20 minutes even when test are running #29

Open khoshalarpy opened 5 years ago

khoshalarpy commented 5 years ago

Hi @mhardin / ,

When we executing the tests , the nodes are getting terminated exactly at 20 minutes and we have been looking at the configurations for this but could not find , appreciate any help on this

please let me know if any specific configuration/parameter needs to be passed to resolve this

mhardin commented 5 years ago

Interesting. Is it the hub or AWS terminating the nodes? The hub shouldn't terminate the node provided tests are running on them (you can check the grid console to see if the nodes have activity on them or not).

khoshalarpy commented 5 years ago

Thanks for response @mhardin

It does seem like the hub is terminating the nodes(not AWS), we have multiple other instances running the same account and instances terminating after 20 minutes is only with selenium nodes ,

we think that in the desired capabilities even though the uuid is passed, seems like it is not recognized and hub is treating the slave as idle (even though tests are executed because of UUID not in sync) and hub sends a cleanup request to terminate the slaves , we also see the UUID as AD-HOC in the subsequent slaves (2nd, 3rd and so forth), is there way to know for sure that the tests executed are passing the UUID and been read and received by nodes as expected and slaves are not treated as idle,

grid.log.txt

mhardin commented 5 years ago

hi @khoshalarpy the log you gave me looks like Selenium's log. Can you provide me the grid scalers log? You can specify the log location (see readme) at process start time.

khoshalarpy commented 5 years ago

Sure @mhardin , please find the gridscalar logs attached,let me know if need more information or if I have still missed anything

also below are the execution commands :

1) to start the hub: [ec2-user@ip-10-169-205-5 cdaf]$ java -Xms4096m -Xmx4096m -DpropertyFileLocation=/cdaf/aws.properties -DipAddress=10.169.205.5 -DtotalNodeCount=100 -DPOOL_MAX=2048 -DlogLocation=/cdaf/gridScaler.log -cp /cdaf/automation-grid.jar org.openqa.grid.selenium.GridLauncherV3 -servlets com.rmn.qa.servlet.AutomationTestRunServlet,com.rmn.qa.servlet.StatusServlet -role hub -hubConfig /cdaf/hub.static.json -log /cdaf/grid.log

2) Launch initial node with curl as specified in documentation: curl 'http://localhost:4444/grid/admin/AutomationTestRunServlet?uuid=cdaf201911050588&threadCount=5&browser=chrome'

3) Execute test with below maven command mvn clean package -Plocal -Dgrid.uuid=cdaf201911050588 "-Dextra.cucumber.options=--tags @testRun"

After 20 it shows below lines in log saying node was pending for longer than 20 minutes,

Nov 05, 2019 07:32:58:924 AM - ERROR [pool-5-thread-1] [AutomationRunContext] Node AutomationDynamicNode{uuid='cdaf201911050588', instanceId='i-0d72583b8c95f4a67', startDate=Tue Nov 05 07:12:45 GMT 2019, ipAddress='10.169.205.35'} was pending longer than 20 minutes. Removing from pending set. Nov 05, 2019 07:32:58:924 AM - WARN [pool-6-thread-1] [AutomationScaleNodeTask] Doing node scale work Nov 05, 2019 07:32:59:113 AM - INFO [pool-5-thread-1] [AwsVmManager] Node [i-0d72583b8c95f4a67] successfully terminated

gridScaler05112019.log.zip

mhardin commented 5 years ago

Ah hah! So we have a task that basically goes and terminates "orphaned" nodes after a set period of time. Orphaned nodes are nodes that never connected to the hub after they started up. Providing your hub IP address is correct, I'm guessing there is a networking issue on your end between your nodes and your hub? I would try starting there with troubleshooting (see if your nodes can access your hub).

khoshalarpy commented 5 years ago

Hi @mhardin , thanks for the suggestion, I can confirm that All trafic (all ports and protocols ) are opened between hub & nodes security group, Also , I want to highlight that the tests are being executed successfully until 19 minutes , we can see from the selenium logs and also the selnium console the browser being used and all the test cases getting executed , once it reaches 20th minute , slave gets terminated,

when checked from the grid scalaer side , I observed that the hub is marking the node as idle even though tests are running and , based on the internal configurations in grid scaler, it is marked for termination after 12+7 minutes, it seems to be unable to find the session UUID of the tests and , however we are passing the uuid for the maven execution as follow, mvn clean package -Plocal -Dgrid.uuid=cdaf201911050588 "-Dextra.cucumber.options=--tags @testrun"

mhardin commented 5 years ago

@khoshalarpy I see 2 log snippets saying: Nov 05, 2019 07:34:28:923 AM - ERROR [pool-5-thread-1] [AutomationRunContext] Node AutomationDynamicNode{uuid='AD-HOC', instanceId='i-077dc5d8e2db77d50', startDate=Tue Nov 05 07:14:14 GMT 2019, ipAddress='10.169.205.20'} was pending longer than 20 minutes. Removing from pending set. . What this means is after starting a node, that node never registered with the hub after 20 minutes, so it basically removes the association with that node, and it will get terminated after a period of time. Note, in the log you sent me, I see no log output of this hub terminating any nodes.

I would next point you to the grid node log, to see if theres anything revealing in there. The problem here is that the node isn't (correctly) connecting to your hub for whatever reason.

rijoparakkal commented 4 years ago

Hi @mhardin, ami-d216a3ba is not available now, Can u please tell us the userdata or packages need to be installed if we are creating new AMI

mhardin commented 4 years ago

@rijoparakkal thats unfortunate. I'm guessing the original company I worked at who owned this image deleted it for whatever reason.

rijoparakkal commented 4 years ago

@mhardin when I use ubuntu 18.04 image .I'm getting some weird user data while instance launching. Can u specify the userdata so that I can create new AMI