web100srv creating defunct process and not responding to clients.

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.Start test with 100 clients simulataneously connect to server
2.check the messages displayed by 100 clients
3.Check the server status using “ps -ef | grep web100srv” command.

What is the expected output? What do you see instead?
Expected Output:
All 100 clients should be complete the Middlebox,Firewall , Outbound and
InBound tests.

Observed Output:

web100srv creating defunct process and after creating defunct process,
web100srv is not responding to clients even web100srv is in running mode.
At this movement, after restarting the web100srv also web100srv is not
responding to the client. Some times web100srv responding to clients but
very slowly responding.But after restarting the system , web100srv is
responding to the clients  

we have observed this issue in our local MLAB server and more frequently.
Logs attached.

Original issue reported on code.google.com by Garimell...@gmail.com on 3 Mar 2010 at 10:12

Merged into: #40

Attachments:

[Server defunct1.png](https://storage.googleapis.com/google-code-attachments/ndt/issue-24/comment-0/Server defunct1.png)
report.xls
total_log.xls

GoogleCodeExporter commented 9 years ago

I need to look at the code more, but it appears that the main control loop needs
work.  It should handle running clients, dispatch new clients when an old client
finishes, and accept new connections when there is queue space.  From the logs, 
and
observed behavior, it is having problems when old clients leave before a test 
is begun.

One correction.  The Mlab nodes are configured to handle a maximum of 80 NDT 
clients
not 100!  The 1st 20 get served, the nest 60 get queued.  The last 20 requests 
should
get rejected with a 9988 'server busy' message.  

Also, the server removes the defunct processes after a test completes.  Some 
time, up
to 30 seconds, may transpire between the time a child process terminates and 
when the
kernel resources are released.  This is not a bug, but a design issue.  The 
problem
is that the server's main processing loop is stalling and not that defunct 
processes
are hanging around forever.  (at least that's what I think right now.  I'll 
examine
the debug log files to see what's going on, probably the weekend of the 13th).

While I could change the code to queue 80 clients, I'm not sure it is 
reasonable.  It
looks like users aren't willing to wait 4-5 minutes before getting test 
results, so
it may be better to reduce the queue depth and make them issue another test 
request.
 I wouldn't change the code now, but this is something we need to discuss.

Rich

Original comment by rcarlson...@gmail.com on 5 Mar 2010 at 1:21

Changed state: Started

GoogleCodeExporter commented 9 years ago

On the 80 vs. 100. correction

Just to clarify, for M-Lab, my memory is that we all discussed queuing 100 
users (the
next 80 after allowing the first 20) and that's what I expected Rich would 
configure
and subsequently QA to test.  If it was set to 80, leave it.  I think 80 is 
fine.  I
also agree most users won't wait 4-5 minutes.

Original comment by funcho...@gmail.com on 5 Mar 2010 at 2:51

GoogleCodeExporter commented 9 years ago

Original comment by jwzuraw...@gmail.com on 10 Mar 2010 at 8:42

Changed state: Duplicate

GoogleCodeExporter commented 9 years ago

Re-verified with V3.6.3 and this bug is not yet fixed. Still defunct processes 
are created and rarely some process are not exited. But the sever is responding 
to the clients normal (previous server stop responding to clients)

Original comment by sekharn...@gmail.com on 10 Jun 2010 at 5:19

nikita21 / ndt

web100srv creating defunct process and not responding to clients. #24