jive-vlbi / jive5ab

The JIVE VLBI data recording and transport utility
9 stars 5 forks source link

Possible segfault with multiple m5copy instances using the same jive5ab? #24

Open varenius opened 1 year ago

varenius commented 1 year ago

We have a jive5ab instance running in a screen, to be used for sending data to correlators. On a number of occasions, this instance has crashed during data transfer, with segfault (ugh). This evening it happened again, and I realised that I just had started a correlation job on the same machine (I never correlate while recording, but usually I don't care about transfers as I assume they should just go a bit slower). The jive5ab screen had this to say (with the remote receiver IP masked):

[...]
202304621:52:55 2023-02-15 20:52:55.11: Processing command 'disk2net=connect:REMOTE_IP_MASKED'
202304621:52:55 2023-02-15 20:52:55.11: compute_theoretical_ipd: ntrack=0 @0 dgsize=1472 => n_pkt_p_s=0
202304621:52:55 2023-02-15 20:52:55.11: openfile_type: found 7325175968 bytes in 29 chunks, 100%
202304621:52:55 2023-02-15 20:52:55.11: open_vbs: opened v23046_oe_046-1914_1 as fd=2147483627 [vbs]
202304621:52:55 2023-02-15 20:52:55.11: getsok_udt: trying REMOTE_IP_MASKED{REMOTE_IP_MASKED}:46206 ...
202304621:52:55 2023-02-15 20:52:55.32: Reply: !disk2net=  1 ;
202304621:52:55 2023-02-15 20:52:55.32: udtwriter: writing to fd=1053368054
202304621:52:55 2023-02-15 20:52:55.33: Processing command 'disk2net?'
202304621:52:55 2023-02-15 20:52:55.33: Reply: !disk2net?  0 : connected : REMOTE_IP_MASKED : 0 : 0 : 7325175968 ;
202304621:52:55 2023-02-15 20:52:55.34: Processing command 'disk2net=on::'
202304621:52:55 2023-02-15 20:52:55.34: Reply: !disk2net=  0 ;
202304621:52:55 2023-02-15 20:52:55.34: vbsreader_c: start reading fd#2147483627, 0 => 7325175968
202304621:52:55 2023-02-15 20:52:55.35: Processing command 'disk2net?'
202304621:52:55 2023-02-15 20:52:55.35: Reply: !disk2net?  0 : active : REMOTE_IP_MASKED : 0 : 0 : 7325175968 ;
202304621:52:55 2023-02-15 20:52:55.36: Processing command 'echo=off'
202304621:52:55 2023-02-15 20:52:55.39: udtwriter: switch to ipd=0 [set=0, , theoretical=0]
LLogr error or EOF reading from stdin: Success
Segmentation fault (core dumped)

Any clues to why this could happen? (We use m5copy to transfer data to this correlator since they do not (yet) accept etd/etc).

haavee commented 1 year ago

Hmmm, sorry to learn about this! Any idea that if it happens, it happens at the same phase during a transfer? That knowledge might help ... Other questions: which version is this? There now is a 3.1.0-release-candidate (3.1.0-rc1) that has all the goodies in, including some possible bug fixes. Not saying that this might be fixed, but just want to make sure that it's not caused by a known-bad thing. If a parallel resource-intense process seems to be causing this then my thoughts go to race condition, which might make detection a bit more difficult - which is why I asked about happening "at the same place". Do you think it's feasible to try to make it repeatable by triggering this, doing something (data)intensive whilst a transfer is happening?

varenius commented 1 year ago

The jive5ab we have running here reports itself as 2023-02-16 08:09:55.75: main: jive5a [ jive5ab : 3.1.0 : 64bit : Debug : gyller : 22-Mar-2021 : 21h05m05s : nossapi : ] ready which I think is a pre-release version of 3.1. We could try the newer version, but it might be worth discussing some more ideas before we switch just to see if the problem is perhaps related to some of the possible bug fixes you mention.

To transfer data as fast as possible with m5copy, we had in this case opted for running multiple m5copy instances in parallell, in this case 6 of them. Each instance used the same receiver and sender jive5ab instances, but with a different data port, with each being executed like: m5copy vbs://:2621/files_part_X* file://REMOTE_IP:46221/data0/v23046_oe/ -udt --resume -p 4620X -t 120 with X being different for the 6 instances (each transferring a different time chunk of data via a different port).

After multiple failures last night, I decided to fallback to one single instance. No issues so far (couple of hours). I also tried running the correlation job in parallell to stress it, no issues yet. So I now suspect it is to do with the multiple instances. Does this seem reasonable? I tried to re-run with multiple instances, but it did not fail directly (it never did before, after some time) so hard to prove. But we could set up some proper multi-instance local tests if this seems like a plausible idea.

(Preferred solution would be to convince correlator to accept etd/etc)

haavee commented 1 year ago

Thanks for the update! Based on this feedback I start to suspect the udt library having an issue. On more than one occasion we've found it to be a bit more fragile than one would like. Possibly pushing this too far makes the streams cross accidentally and we all know that that is a no-go ;-) Another option could be to convince the correlator to accept vbs chunks and use vbs_fs to present them as single files; the vbs -> vbs transfer can be set up with multiple senders/receivers (using the same setting as setting the number of readers+writers using record = nthread : nDiskReaders : nDiskWriters)

varenius commented 1 year ago

no vbs option I'm afraid, only vbs --> file.

After waiting a little more with 5 streams it crashed again, same error. So I'm fairly sure it's conneted to this.

Interestingly: jive5ab locally crashed with the output above. No instance running. Remotely, however, jive5ab survived partly. I'll explain the "partly". We need to start a receiving jive5ab instance at the remote correlator. The remote server does not have screen installed. Therefore, we start "screen" locally, then ssh to remote corr in that screen, and start jive5ab. After the crash, I checked the screen and I was logged out from the remote correlator. However, when logging back in, I find with "ps" that jive5ab is still running, and is accepting data. So whatever caused this crash, it made the local jive5ab fail totally, and the remote one only partly...

haavee commented 1 year ago

Would it be possible to compile the local version with debug (-DCMAKE_BUILD_TYPE=Debug) in case that wasn't done? With that version, a coredump would be appreciated, or load the program in the debugger and then let 'r rip and try to catch the SIGSEGV and have the backtrace shed some light on where to start looking ...

varenius commented 1 year ago

Already built with Debug; now prepared for a core dump next time it happens.

Meanwhile, Simon C here checked dmesg after the failures. Almost all of them hade the same message:

[Wed Feb 15 22:33:50 2023] jive5ab-3.1.0-6[16890]: segfault at 7b22488daf60 ip 000056510cfaac74 sp 00007f23c4ff8d40 error 4 in jive5ab-3.1.0-64bit-Debug[56510c882000+90c000]
[Wed Feb 15 22:33:50 2023] Code: 75 e0 48 89 55 d8 48 8b 45 e8 48 8b 50 38 48 8b 45 e8 48 8b 40 40 48 39 c2 75 07 b8 00 00 00 00 eb 48 48 8b 45 e8 48 8b 40 38 <48> 8b 10 48 

This would perhaps suggest it's code-related rather than e.g. bad hardware?

I'll get back with core dump if/when it happens again.

haavee commented 1 year ago

Oh it's definitely code related :-) No worries there. This snippet might already be able to trace it to somewhere in the binary, which would help

haavee commented 1 year ago

Could you talk with @psi-c (your colleague) and use the latest issue-25 branch? That one contains some fixes from issue #25 but I've added a commit there that may hopefully fix this one as well ...

varenius commented 1 year ago

Thanks! We have installed the latest issue-25 code, will use it for upcoming transfers and see if it fails. May take a few weeks to feel confident if fixed or not - will report back once we have logged some significant transfer hours.

varenius commented 1 year ago

We had some time without errors, but yesterday evening we got a segfault again. No core dump (as far as I could see, maybe @psi-c could find one), in dmesg I found the lines

[Mon Mar 20 22:52:23 2023] jive5ab-issue25[868]: segfault at 7e405441efe8 ip 000055d0b3f08e83 sp 00007f3f9b7fdcc0 error 4 in jive5ab-issue25-d31d7ce-Debug[55d0b3449000+d27000]
[Mon Mar 20 22:52:23 2023] Code: 30 48 89 45 c8 c7 45 c0 00 00 00 00 8b 45 c0 3b 45 9c 7d 12 48 8b 45 c8 48 8b 40 20 48 89 45 c8 83 45 c0 01 eb e6 48 8b 45 c8 <8b> 40 18 85 c0 78 40 e8 bd 2e 00 00 48 89 c2 48 8b 45 c8 48 8b 40

Not sure if it is relevant, but dmesg also shows many lines like this for several hours before the crash:

[Mon Mar 20 22:38:59 2023] net_ratelimit: 33 callbacks suppressed
[Mon Mar 20 22:38:59 2023] RPC: fragment too large: 83888899
[Mon Mar 20 22:38:59 2023] RPC: fragment too large: 1224736768
[Mon Mar 20 22:39:03 2023] RPC: fragment too large: 1936029709
[Mon Mar 20 22:39:03 2023] RPC: fragment too large: 1312236900
[Mon Mar 20 22:39:11 2023] RPC: fragment too large: 83888899
[Mon Mar 20 22:44:30 2023] RPC: fragment too large: 369295619
[Mon Mar 20 22:44:30 2023] RPC: fragment too large: 369295619
[Mon Mar 20 22:44:30 2023] RPC: fragment too large: 369295617
[Mon Mar 20 22:44:30 2023] RPC: fragment too large: 369295617
[Mon Mar 20 22:44:30 2023] RPC: fragment too large: 369295619
[Mon Mar 20 22:44:30 2023] RPC: fragment too large: 369295619
[Mon Mar 20 22:44:30 2023] RPC: fragment too large: 369295619
[Mon Mar 20 22:44:30 2023] RPC: fragment too large: 369295619
[Mon Mar 20 22:44:30 2023] RPC: fragment too large: 369295363
[Mon Mar 20 22:44:30 2023] RPC: fragment too large: 369295363

but maybe that's unrelated.

haavee commented 1 year ago

That is weird - could it be that someone is trying rpc-based attacks by randomly sending udp data to ports? For the udt protocol incoming traffic must be allowed; is the machine firewalled to allow only incoming from the remote host where you’re transferring to? Otherwise, can we correlate these messages with ongoing transfers? (using the jive5ab log)?

varenius commented 1 year ago

@haavee The RPC is probably indeed due to unrestrictive firewall, @psi-c should be able to comment on those. I think it's probably unrelated to the error at hand. In any case; we had a few more failures, now with the with the issue-25 branch, most recent one we managed to get another core-dump which you can find at /var/lib/apport/coredump/core._usr_local_bin_jive5ab-issue25-d31d7ce-Debug.1002.c3fe42e6-d416-4853-9577-81fc18c9b8bc.15040.8103. The onscreen messages were

...
202310710:50:55 2023-04-17 08:50:55.58: netreader: waiting for incoming connection
202310710:50:55 2023-04-17 08:50:55.92: netreader: incoming dataconnection from 202.223.161.132:55110
202310710:50:55 2023-04-17 08:50:55.92: udtreader: read fd=851523302 rd:1312 wr:1312 bs:268435200
LLogr error or EOF reading from stdin: Success
Segmentation fault (core dumped)

and dmesg -T said

[Mon Apr 17 10:51:10 2023] jive5ab-issue25[1824668]: segfault at 7e8f80d6f388 ip 0000556b806c05a5 sp 00007f8fc2ffc0c0 error 4 in jive5ab-issue25-d31d7ce-Debug[556b7fc00000+d27000]
[Mon Apr 17 10:51:10 2023] Code: ec 7e 06 8b 45 ec 89 45 f0 8b 45 f0 48 63 d0 48 8b 45 d8 48 8b 00 8b 4d e8 48 63 c9 48 c1 e1 03 48 01 c8 48 8b 00 48 8b 40 20 <48> 8b 08 48 8b 45 d8 8b 40 24 48 98 48 01 c1 48 8b 45 d0 48 89 ce
haavee commented 1 year ago

I get an EPERM trying to access the core dump :-(

psi-c commented 1 year ago

See if you can give it a poke now :)

haavee commented 1 year ago

Interesting observation + question following from that. "Of course" it happens inside the UDT library again. I see the code SIGSEGVing somewhere downstream from udtreader( ... ) calling UDT::recv(...), which surprises me. I thought the transfers are initiated outwards and not pulling data in? The udtreader() function (should) only be entered in the case of downloading data using UDT as far as I know.

Can you elaborate a bit on the actual transfer command line(s) used leading up to the SIGSEGV?

varenius commented 1 year ago

Well, we have transfers in both directions. We have had segfaults on transfers initiated both from Ishioka and Ny-Ålesund, as well as our own outgoing transfers (mainly to USNO). In this particular case, it was dealing with incoming data from Ishioka when it aborted. I don't have their exact command, but I can ask if that helps? At least I know the target was to store as VBS on our end. Maybe the fix applied so far adressed an outgoing bug, but there is still one for receiving data?

haavee commented 1 year ago

Thx for the clarification. Yes it might be that the patch I did fixed outgoing issues. Please try to catch more SIGSEGVs, that would help in confirming they're happening in the same bit-o-code (or not, which might also "help")

varenius commented 1 year ago

Another one:

[Mon May  8 07:28:22 2023] jive5ab-issue25[894287]: segfault at 7baff9235208 ip 00007fb00cb5591e sp 00007fafd97f9880 error 4 in libc.so.6[7fb00cadc000+195000]
[Mon May  8 07:28:22 2023] Code: 0f ef c9 0f 11 4b 20 48 89 ea 0f 11 43 10 48 83 ca 01 48 89 53 08 49 89 2e 4d 39 f8 74 8e 4c 89 cb f6 c3 0f 0f 85 e2 00 00 00 <48> 8b 53 08 89 d1 c1 e9 04 83 e9 02 49 8d 4c cd 10 49 39 cc 0f 85

this time also during receiving data:

...
202312807:28:02 2023-05-08 05:28:02.58: Processing command 'mode=VDIF_1280-2048-32-1'
202312807:28:02 2023-05-08 05:28:02.58: Reply: !mode= 0 ;
202312807:28:02 2023-05-08 05:28:02.75: Processing command 'record=on:c23126_is1_0920_2023126092025_3'
LLogr error or EOF reading from stdin: Success
Segmentation fault (core dumped)

core dump at gyller:/tmp/core-jive5ab-issue25.1846251.gyller.1683523682

varenius commented 1 year ago

Note: Recent segfaults happen with only ONE incoming transfer stream, so this segfault seems to be unrelated to the "multiple" in the subject of this issue. Still, I think it's best for clarify to continue the list of crashes in this ticket? Most recent one:

jive5abn log tail:

202314210:53:10 2023-05-22 08:53:10.77: Processing command 'mode=VDIF_1280-2048-32-1'
202314210:53:10 2023-05-22 08:53:10.77: Reply: !mode= 0 ;
202314210:53:10 2023-05-22 08:53:10.94: Processing command 'record=on:c23140_is1_0926_2023140092645_2'
LLogr error or EOF reading from stdin: Success
Segmentation fault (core dumped)

dmesg -T:

[Mon May 22 21:47:43 2023] python3[2287149]: segfault at 520241e825c8 ip 0000560241a39302 sp 00007ffc9adcc040 error 4 in python3.10[560241979000+2b3000]
[Mon May 22 21:47:43 2023] Code: 54 55 48 89 fd 53 48 83 ec 18 4c 8b 67 10 4c 89 e3 48 83 eb 01 78 27 4d 39 ee 75 6c 4c 8b 64 dd 18 4d 85 e4 0f 84 94 01 00 00 <49> 8b 44 24 08 f6 80 a9 00 00 00 40 75 20 48 83 eb 01 73 de 31 c0

core dump at gyller:/tmp/core-jive5ab-issue25.15128.gyller.1684745591

EDIT: This should have been happening after we disabled "memory overcommit" which was previously enabled on this machine. To disable: change vm.overcommit_memory = 2 vm.overcommit_ratio = 97 in sysctl.conf, and then rebooted

haavee commented 1 year ago

Thx! I don't have (read)permission on the coredump :-( But maybe the reason why it's now crashing is because of malloc() & friends not succeeding but the lib not handling it. I hope the coredump would tell me some more: can you open it up for me?

psi-c commented 1 year ago

Readable now!