Open kierr opened 8 years ago
Hi Lets clarify situation a bit.
You have gridrouter on server A, and only one hub on server B. That one hub get all the requests, do I understand correcty? You have runned 75-150 concurrent tests through this hub and now you try only 25, right?
Gridrouter go the first Hub and try to get browser. If Hub is full it put new request in queue. Request can sit in this queue as long as it set for the hub in newSessionWaitTimeout. This is setting of the hub. If this timeout happens then Gridrouter go to another hub and try to get session there. If Gridrouter fail to get new session on all hubs from the quota it logs that SESSION_FAILED you get. So in our installation we have many hubs with small amount of browsers on each and we have newSessionWaitTimeout set on hubs to 10 seconds, so requests could find available hub fast.
In your case, as I understand, you have only one hub with large amount of browsers connected. This configuration itself can run slow because hub has troubles to work with many concurrent sessions. And of course there is no much sence to use Gridrouter with one hub. I'm not sure why in your case sessions are not created on the hub. Maybe you ask more session then hub has or there is an issue of hub performance. You can open you hub console http://youhub.url:4444/grid/console?config=true&configDebug=true check your newSessionWaitTimeout and check how new sessions are created or put to the queue
I would try one of the follow things:
Thanks so much for your detailed reply. To clarify, I do have multiple Hubs, but went back to single hub to figure out the root cause of this issue.
You are right about newSessionWaitTimeout, I have been playing with this value, and most recently I set it from 10s down to just 1s.
My main confusion was why router would send a request, give up waiting for a response, and then the server would still satisfy that request and create an orphaned session. Router would then request another session, timeout again, and another session would be created. This is why I set the newSessionWaitTimeout very low, because I thought that would resolve things.
Even when I have either 25 each or 75-150 across 3 or more nodes, only running 25 concurrent tests, this orphan session situation snowballs to the point where all the queues are full and no sessions are available.
If I made newSessionWaitTimeout large, the queue would simply fill and sessions would be started, orphaned, new sessions created and so on, and I have the same problem. With it low, the queue stays empty.
I could just set timeout on the node very low, but this is kludgy, and all of these orphaned sessions would still be created and left to timeout creating unnecessary overhead.
Hope I have clarified a bit.
I got the point It's strange inconsistency in communication between hub and gridrouter you described..
Sometimes we have cases of orpanted sessions when gridrouter keep trying to find new session while client timeouted already. And if there are retries implemented on tests side it could lead to the similar problems
There are timeout settings on gridrouter side that responsible for establishing http connection itself, but if connection with hub is established then gridrouter waits until hub response. According to the logged error hub did responded that no node has become available. It's quite strange that hub creates session on that request after that response.
Can you show your hub settings?
Btw when you tried with one hub, do you have that hub listed several times in gridrouter quota? Gridrouter goes through hubs listed in quota only once, so it's strange that it came back to the same hub again with next wave of attempts.
Can you describe how you run tests? How many how frequent, do you have retries on fails?
Btw2 we usually don't use 0 or 1 sec timeout on hub, because sometimes launch of the browser takes longer
I am plagued by this scenario:
At the router on server A, 25 new sessions are attempted at once.
The only configured hub on Server B responds to the first request and the test starts running, then the second and the rest runs, and the third, and so on - with things getting slower and slower due to network traffic as the tests start.
The router gives up on waiting for slow portion of these session requests, and:
WARN RouteServlet - [SESSION_FAILED] [global] [172.18.0.1] [firefox] [http://15.32.16.63:4444] - Error forwarding the new session Request timed out waiting for a node to become available.
It then fires up another bunch of new session requests... BUT, the original session requests are still processed on the hub after some delay, with the second lot of requests simply queued behind these, the sessions are started but the client has forgotten about it. Then, due to further network delay, the Router ends more extra sessions, and needless to say everything snowballs into a nightmare of errors.
So, my question is a simple one... can I increase this session request timeout? Is there a wiser solution to prevent this domino effect?
I'm running very powerful enterprise computational hardware, on which I've managed to run between 75 and 150 concurrent tests depending on number of CPUs, before using gridrouter. But these network issues are my main contender. Is the solution to use lighter 2 core machines? Still, though, even with just 25 concurrent sessions, timeouts seem to be a problem.
Thanks again guys.