I've seen left over iperfs from bwctl on our perfSONARs. It is a cause for
concern because the hangers-on seem to block ongoing testing, at least in
our case. kill -9 works and appears to be sufficient to clean up and allow
testing to resume.
This is purely anecdotal (perhaps the pS team can provide further insight):
if you have a lot of bwctl tests configured with that machine, you may want
to check the test frequency % reported on the test configuration page.
bwctl/iperfs on one of the XSEDE perfSONARs were hanging much more
frequently than on the other 7 pSs. Not all the XSEDE pSs run identical
testing and that machine happened to be running with more frequent testing,
reported at 13% vs. 6-7% of the time on the other pSs. Reducing the
frequency of testing to 9% seems to have eliminated the hangs.
The original user noted:
Kill -9 worked. It doesn't seem like the rogue iperfs caused any real problems though - tests still carried on - perhaps because I had enough free ports. I'm only running test 1% of the time - what I thought was a bit strange though is that the tests are scheduled for every 4 hrs but the results I get on the graphs are every 8 hours. I will monitor the situation now and see what happens.
Talking with Aaron, he looked at BWCTL and noticed that it does do a TERM and KILL, but further investigation is needed.
Original issue 734 created by arlake228 on 2013-06-13T12:40:49.000Z:
From a user:
I've noticed varying amounts of iperf processes in a sleep/interrupt state on my servers.
PS I'm using pS-PS 3.2.2 version.
Thanks a lot!
Regards,
Roderick
bwctl 11267 0.0 0.0 33952 972 ? Sl Mar15 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5086 -w 35651584 -t 20 bwctl 11803 0.0 0.0 33952 964 ? Sl May01 0:00 iperf -B 155.232.40.2 -s -f b -m -p 5098 -w 35651584 -t 20 bwctl 11982 0.0 0.0 33952 972 ? Sl May18 0:03 iperf -B 155.232.40.2 -s -f b -m -p 5066 -w 35651584 -t 20 bwctl 11998 0.0 0.0 33952 964 ? Sl Apr09 0:07 iperf -B 155.232.40.2 -s -f b -m -p 5037 -w 35651584 -t 20 bwctl 12278 0.0 0.0 33952 968 ? Sl Jun03 0:02 iperf -B 155.232.40.2 -s -f b -m -p 5036 -w 35651584 -t 20 bwctl 12749 0.0 0.0 33952 972 ? Sl Jun08 0:00 iperf -B 155.232.40.2 -s -f b -m -p 5089 -w 35651584 -t 20 bwctl 12782 0.0 0.0 33952 972 ? Sl Apr03 0:04 iperf -B 155.232.40.2 -s -f b -m -p 5025 -w 35651584 -t 20 bwctl 13052 0.0 0.0 33952 976 ? Sl Apr11 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5059 -w 35651584 -t 20 bwctl 15450 0.0 0.0 33952 1232 ? Sl May21 0:04 iperf -B 155.232.40.2 -s -f b -m -p 5032 -w 35651584 -t 20 bwctl 15499 0.0 0.0 33952 968 ? Sl Mar10 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5071 -w 35651584 -t 20 bwctl 16516 0.0 0.0 33952 968 ? Sl Jun06 0:06 iperf -B 155.232.40.2 -s -f b -m -p 5020 -w 35651584 -t 20 bwctl 16980 0.0 0.0 33952 968 ? Sl Apr25 0:01 iperf -B 155.232.40.2 -s -f b -m -p 5020 -w 35651584 -t 20 bwctl 17122 0.0 0.0 33952 976 ? Sl May26 0:04 iperf -B 155.232.40.2 -s -f b -m -p 5075 -w 35651584 -t 20 bwctl 17315 0.0 0.0 33952 964 ? Sl Apr22 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5004 -w 35651584 -t 20 bwctl 17413 0.0 0.0 33952 968 ? Sl Mar09 0:06 iperf -B 155.232.40.2 -s -f b -m -p 5024 -w 35651584 -t 20 bwctl 18284 0.0 0.0 33952 1220 ? Sl Apr07 0:03 iperf -B 155.232.40.2 -s -f b -m -p 5067 -w 35651584 -t 20 bwctl 18768 0.0 0.0 33952 968 ? Sl May30 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5015 -w 35651584 -t 20 bwctl 19219 0.0 0.0 33952 1224 ? Sl May24 0:03 iperf -B 155.232.40.2 -s -f b -m -p 5034 -w 35651584 -t 20 bwctl 19253 0.0 0.0 33952 1224 ? Sl Mar18 0:06 iperf -B 155.232.40.2 -s -f b -m -p 5034 -w 35651584 -t 20 bwctl 21006 0.0 0.0 33952 972 ? Sl Jun11 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5065 -w 35651584 -t 20 bwctl 21108 0.0 0.0 33952 972 ? Sl Jun09 0:07 iperf -B 155.232.40.2 -s -f b -m -p 5013 -w 35651584 -t 20 bwctl 22203 0.0 0.0 33952 968 ? Sl Mar27 0:03 iperf -B 155.232.40.2 -s -f b -m -p 5078 -w 35651584 -t 20 bwctl 22826 0.0 0.0 33952 972 ? Sl May07 0:04 iperf -B 155.232.40.2 -s -f b -m -p 5086 -w 35651584 -t 20 bwctl 22850 0.0 0.0 33952 964 ? Sl Apr30 0:04 iperf -B 155.232.40.2 -s -f b -m -p 5033 -w 35651584 -t 20 bwctl 23515 0.0 0.0 33952 1228 ? Sl May10 0:08 iperf -B 155.232.40.2 -s -f b -m -p 5033 -w 35651584 -t 20 bwctl 23757 0.0 0.0 33952 964 ? Sl May23 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5048 -w 35651584 -t 20 bwctl 23791 0.0 0.0 33952 1232 ? Sl Apr05 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5071 -w 35651584 -t 20 bwctl 24848 0.0 0.0 33952 972 ? Sl Apr10 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5086 -w 35651584 -t 20 bwctl 25112 0.0 0.0 33952 1220 ? Sl Apr15 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5015 -w 35651584 -t 20 bwctl 25531 0.0 0.0 33952 968 ? Sl May29 0:03 iperf -B 155.232.40.2 -s -f b -m -p 5057 -w 35651584 -t 20 bwctl 26490 0.0 0.0 33952 968 ? Sl Apr28 0:06 iperf -B 155.232.40.2 -s -f b -m -p 5036 -w 35651584 -t 20 bwctl 27197 0.0 0.0 33952 968 ? Sl May19 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5007 -w 35651584 -t 20 bwctl 27478 0.0 0.0 33952 968 ? Sl Apr24 0:05 iperf -B 155.232.40.2 -s -f b -m -p 5047 -w 35651584 -t 20 bwctl 27848 0.0 0.0 33952 972 ? Sl Mar26 0:06 iperf -B 155.232.40.2 -s -f b -m -p 5013 -w 35651584 -t 20 bwctl 27850 0.0 0.0 33952 968 ? Sl Mar12 0:04 iperf -B 155.232.40.2 -s -f b -m -p 5075 -w 35651584 -t 20 bwctl 28072 0.0 0.0 33952 972 ? Sl May16 0:02 iperf -B 155.232.40.2 -s -f b -m -p 5023 -w 35651584 -t 20 bwctl 28344 0.0 0.0 33952 964 ? Sl May21 0:03 iperf -B 155.232.40.2 -s -f b -m -p 5070 -w 35651584 -t 20
And another:
I've seen left over iperfs from bwctl on our perfSONARs. It is a cause for concern because the hangers-on seem to block ongoing testing, at least in our case. kill -9 works and appears to be sufficient to clean up and allow testing to resume.
This is purely anecdotal (perhaps the pS team can provide further insight): if you have a lot of bwctl tests configured with that machine, you may want to check the test frequency % reported on the test configuration page. bwctl/iperfs on one of the XSEDE perfSONARs were hanging much more frequently than on the other 7 pSs. Not all the XSEDE pSs run identical testing and that machine happened to be running with more frequent testing, reported at 13% vs. 6-7% of the time on the other pSs. Reducing the frequency of testing to 9% seems to have eliminated the hangs.
The original user noted:
Kill -9 worked. It doesn't seem like the rogue iperfs caused any real problems though - tests still carried on - perhaps because I had enough free ports. I'm only running test 1% of the time - what I thought was a bit strange though is that the tests are scheduled for every 4 hrs but the results I get on the graphs are every 8 hours. I will monitor the situation now and see what happens.
Talking with Aaron, he looked at BWCTL and noticed that it does do a TERM and KILL, but further investigation is needed.