iperf3 hangs with -R and -Z flags

GoogleCodeExporter commented 9 years ago

When running the new test script (test_commands.sh), the iperf3 client hangs on 
2 of the tests:

./src/iperf3 -c $host -P 2 -t 5 -R
and
./src/iperf3 -c $host -Z -t 5

And when you ^C the client, the server dies.

Original issue reported on code.google.com by bltier...@es.net on 20 Dec 2013 at 10:51

GoogleCodeExporter commented 9 years ago

This happened on OSX, but Linux seems OK.

Original comment by bltier...@es.net on 20 Dec 2013 at 11:20

Added labels: Milestone-3.0-Release

GoogleCodeExporter commented 9 years ago

This seems to reliably reproduce the problem on linux:

#!/bin/sh
set -x
while [ 1 ]
do
  ./src/iperf3 -P 2 -c localhost -t 5
  ./src/iperf3 -P 2 -c localhost -t 5 -R
done

It works for 3-6 loops, and then locks up. (1 time the server crashed).

Hopefully that will help track it down.

Original comment by bltier...@es.net on 22 Dec 2013 at 3:09

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Running the server in gdb shows that the server is crashing on this line:

Program received signal SIGSEGV, Segmentation fault.
0x000000305784812c in vfprintf () from /lib64/libc.so.6

Which is called from here:

1808                iprintf(test, report_sum_bw_retrans_format, start_time, end_time, 
ubuf, nbuf, retransmits, irp->omitted?report_omitted:"");

Maybe Sasant's new patch will fix this?

Original comment by bltier...@es.net on 24 Dec 2013 at 4:15

GoogleCodeExporter commented 9 years ago

I am too able to reproduce this . The reverse -R option server getting crashed

getsockopt(5, SOL_TCP, TCP_INFO, "\1\0\0\0\0\7w\0(\21\3\0@\234\0\0\270\377\0\0\30\2\0\0\0\0\0\0\0\0\0\0"..., [104]) = 0 getsockopt(7, SOL_TCP, TCP_INFO, "\1\0\0\0\0\7w\0(\21\3\0@\234\0\0\270\377\0\0\30\2\0\0\0\0\0\0\0\0\0\0"..., [104]) = 0 write(1, "- - - - - - - - - - - - - - - - "..., 50- - - - - - - - - - - - - - -

) = 50 write(1, "[ 5] 8.02-9.00 sec 382 MB"..., 67[ 5] 8.02-9.00 sec 382 MBytes 3.27 Gbits/sec 5
) = 67 write(1, "[ 7] 8.02-9.00 sec 381 MB"..., 67[ 7] 8.02-9.00 sec 381 MBytes 3.26 Gbits/sec 0
) = 67 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x5} --- +++ killed by SIGSEGV (core dumped) +++ Segmentation fault (core dumped)


(gdb) bt
#0  0x000000399144908f in vfprintf () from /lib64/libc.so.6
#1  0x000000000040542a in vprintf (__arg=0x7fffffffda08, 
    __fmt=0x4110e0 <report_sum_bw_retrans_format> "\340SUM] %6.2f-%-6.2f sec  %ss  %ss/sec", ' ' <repeats 14 times>, "%s\n") at /usr/include/bits/stdio.h:38
#2  iprintf (test=test@entry=0x617010, format=0x4110e0 
<report_sum_bw_retrans_format> "\340SUM] %6.2f-%-6.2f sec  %ss  %ss/sec", ' ' 
<repeats 14 times>, "%s\n")
    at iperf_api.c:2405
#3  0x000000000040618b in iperf_print_intermediate (test=test@entry=0x617010) 
at iperf_api.c:1808
#4  0x0000000000406468 in iperf_reporter_callback (test=0x617010) at 
iperf_api.c:2008
#5  0x000000000040c9ac in tmr_run (nowP=nowP@entry=0x7fffffffdd10) at 
timer.c:189
#6  0x0000000000409f43 in iperf_run_server (test=test@entry=0x617010) at 
iperf_server_api.c:586
#7  0x0000000000401e92 in run (test=0x617010) at main.c:116
#8  main (argc=<optimized out>, argv=0x7fffffffdf68) at main.c:91

gdb) f 0
#0  0x000000399144908f in vfprintf () from /lib64/libc.so.6
(gdb) list
43  __STDIO_INLINE int
44  getchar (void)
45  {
46    return _IO_getc (stdin);
47  }
48  
49  
50  # ifdef __USE_MISC
51  /* Faster version when locking is not necessary.  */
52  __STDIO_INLINE int

  Looks like the stack is getting corrupted somewhere which is leading to crash   
Need to dig more what is really causing the crash

Original comment by susant%redhat.com@gtempaccount.com on 24 Dec 2013 at 5:26

GoogleCodeExporter commented 9 years ago

I've been doing some digging into this.  The hang and the crash *might* have 
two different causes, or might be two different manifestations of the same 
problem.  Notes from a private email on this subject, where I was describing 
what I saw with FreeBSD 10.0 and -R.  There's a hang but no crash.

-----

A slightly lower level symptom of this problem is that at the end of the
test, the client tries to send an TEST_END state change message to the
server over the control connection.  When in -R mode, the server doesn't
seem to get it or read it reliably.  However if I kill the client
(because it seems hung) the server immediately gets the TEST_END and
tries to do the end-of-test processing (it can't do this successfully
because at this point the client has died and closed its side of the
control connection).

In non -R mode this part all works as expected (I see the client send
the TEST_END and the server receives it immediately, as we would expect).

This is all on FreeBSD 10.0, client and server on the same machine (so
far it looks like the configuration where client and server are on the
same machine is particularly vulnerable to this problem).

Original comment by bmah@es.net on 24 Dec 2013 at 7:12

GoogleCodeExporter commented 9 years ago

Partial fix committed in c499d0008f7d.  There was basically a deadlock between 
the client and server in -R mode, see commit log for more details.

Not closing this yet...need to do some more tests to get a warm fuzzy feeling 
about the fix first.  Also note that this doesn't address the server-side 
crashes that have been reported (but which I have not personally witnessed).

Original comment by bmah@es.net on 3 Jan 2014 at 6:09

GoogleCodeExporter commented 9 years ago

Fixed the -P and -R server-side crash reported via Comments 2, 3, and 4, in 
423166a54849.  This only affected Linux; it was a mangled printf format string 
that only got used on that platform (it would have been used on any other 
platform with retransmit statistics, but there aren't currently any).

It's clear to me now that there were multiple issues being reported in this one 
bug.  :-p

Original comment by bmah@es.net on 3 Jan 2014 at 6:38

GoogleCodeExporter commented 9 years ago

If gcc isn't spitting out warnings on format strings as const char variables, 
it'd probably make sense to turn the format strings into typedefs or something 
to ensure that gcc spits out a warning if this kind of mismatch happens.

Original comment by AaronMat...@gmail.com on 3 Jan 2014 at 6:43

GoogleCodeExporter commented 9 years ago

Good point.  I don't see any warning messages for the format string mismatch 
(on a working copy rolled back to before my fix), but gcc isn't compiling with 
any warnings enabled either, as far as I can tell:

gcc -DHAVE_CONFIG_H -I.     -g -O2 -MT iperf_api.o -MD -MP -MF 
.deps/iperf_api.Tpo -c -o iperf_api.o iperf_api.c

I'm not sure why this is...I'm used to living under -Wall and -Werror.  Yet 
another thing to investigate.

Original comment by bmah@es.net on 3 Jan 2014 at 7:04

GoogleCodeExporter commented 9 years ago

Update:  Just one sub-issue remaining from this bug report...that's the hang 
with -Z.  I've been able to observe this on Mac OS, as reported in the initial 
bug report.  It doesn't happen every time, at least not on my MacBook; 
sometimes the -Z test works just fine.

So far I have not been able to reproduce this problem on my other two 
development platforms (FreeBSD 10 and CentOS 6).

It's not clear to me if there's something platform-specific lurking about or 
not, although the sendfile(2) call used by the -Z option is slightly different 
on the three platforms I've been using (therefore there are slightly different 
codepaths being used).

Original comment by bmah@es.net on 3 Jan 2014 at 10:52

GoogleCodeExporter commented 9 years ago

In my tests, OSX hangs every time. Linux is now working fine.

Original comment by bltier...@gmail.com on 4 Jan 2014 at 3:21

GoogleCodeExporter commented 9 years ago

Update:  I'm still seeing this issue (but not consistently) on MacOS 10.8 and 
MacOS 10.9.

Original comment by bmah@es.net on 21 Jan 2014 at 9:08

jkorell / iperf

iperf3 hangs with -R and -Z flags #129