set_disks? not entirely benign?

psi-c commented 11 months ago

It would appear that should one have a fascination with the "set_disks?" query, jive5ab becomes rather annoyed and ultimately retaliates by replying

Reply: !set_disks? 0 : 0 ;

Which for a flexbuff isn't particularly useful.

One buff (30 mount points) got away with 1037 queries before bad things happened, the other (36 mount points) 1092 so that looks to be the order of what jive5ab (linux/ulimit?) tolerates. From the log, could it be set_disks? is opening the mountpoints again on each call, without freeing them?

202333420:16:02 2023-11-30 19:16:02.65: Processing command 'set_disks?' 202333420:16:02 2023-11-30 19:16:02.65: Reply: !set_disks? 0 : 30 : /mnt/disk0 : /mnt/disk1 : /mnt/disk10 : /mnt/disk11 : /mnt/disk12 : /mnt/disk13 : /mnt/disk14 : /mnt/disk15 : /mnt/disk> 202333420:16:04 2023-11-30 19:16:04.81: Processing command 'record=mk6:0' 202333420:16:04 2023-11-30 19:16:04.81: Reply: !record= 0 ; 202333420:16:04 2023-11-30 19:16:04.82: Processing command 'scan_set=v23334_oe_334-1735_3::' 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk21)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk22)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk23)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk24)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk25)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk26)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk27)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk15)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk2)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk16)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk20)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk12)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk28)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk13)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.82: scanMk6RecordingMountpoint(/mnt/disk3)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.83: scanMk6RecordingMountpoint(/mnt/disk5)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.83: scanMk6RecordingMountpoint(/mnt/disk7)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.83: scanMk6RecordingMountpoint(/mnt/disk9)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.83: scanMk6RecordingMountpoint(/mnt/disk6)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.83: scanMk6RecordingMountpoint(/mnt/disk29)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.83: scanMk6RecordingMountpoint(/mnt/disk4)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.83: scanMk6RecordingMountpoint(/mnt/disk8)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.83: scanMk6RecordingMountpoint(/mnt/disk18)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.83: scanMk6RecordingMountpoint(/mnt/disk17)/ ::opendir fails - Too many open files 202333420:16:04 2023-11-30 19:16:04.83: scanMk6RecordingMountpoint(/mnt/disk14)/ ::opendir fails - Too many open files

haavee commented 11 months ago

Interesting issue!

All the more puzzling since set_disks? does nothing but returning the current list of mountpoints (a list of strings formatted into a single string), no systemcalls other than eventually the network I/O to your client (and possibly memory allocation for the reply string) should be harmed in that process no matter how often you call it. Even more interesting: the set_disks= command does not even use ::opendir() directly; behind the scenes it uses nftw(3) which possibly would call ::opendir(). The code even makes sure nftw(3) isn't called from multiple threads at the same time b/c its MT_SAFE state is "need not be thread safe".

All the calls to ::opendir() happen as a result of the scan_set= and as far as I can see now all of those are balanced by an accompanying ::closedir() in the application.

In this log I see is what I suspect to be a truncated reply to the set_disks? query. Whether that is an effect of copy/paste into the ticket gone wrong or a sign of bolloxoring of the internal program state could be checked.

psi-c commented 11 months ago

Interesting, your description is indeed how I'd expect it to behave (i.e. it affects nothing). The final (good) reply is just truncated in my copy paste, it's a full set of disks in the log. The only change I'm aware of since this happened is the insertion of a "set_disks?" for each scan recorded hence the finger pointing... I'll check nothing else was fiddled with at the same time!

psi-c commented 11 months ago

So it seems they are running several m5copys in parallel (vbs to remote file) and somewhere along the line files are not being closed, but transfers seem to complete correctly at the destination.

After one jive5ab gave up last night they switched to another on a different port and resumed. On skirner, that new jive5ab has around 440 open vbs files (transfer finished about 4 hours ago), most of which are open by 2 threads plus the main PID. Can't see anything unusual in the log though. The same was done on gyller but that jive5ab seems to have a healthy amount of files open...

haavee commented 11 months ago

Ah ok. It may be that indeed another transfer in the same process is hogging up file descriptors. I think I heard before that some transfers seem to not release all the file descriptors. This is all vbs2file being done?

psi-c commented 11 months ago

This is using disk2net (vbs to net), odd that both buffs suffered the same fate last night. I've suggested removing the set_disks query and repeat to see if it happens any more.

haavee commented 11 months ago

My .02 arbitrary currency units will be that it won't affect the outcome 🤞 It is far more likely that the disk2net has leaky filedescriptors - especially having heard such allusions from other sources too

psi-c commented 11 months ago

I'll likely be offering you .02 zl in a couple of weeks then!

I did count the occurrences of

disk2net=on disk2net(vbs) finalization done close_filedescriptor: closed

in the second log file on skirner, each popped up 442 times, which matches rather well with the number of files open - 1 per scan.

haavee commented 11 months ago

Crikey. Well, not that arbitrary currency units! But it looks like there's one-fd-per-scan leaked. That's something to work with I say. Would you also happen to know if the transfers are happening in "transient" runtimes or in a single/dedicated one?

psi-c commented 11 months ago

I count 444 occurrences of transient so I'd say that's a big yes! The first 2 transients died due due to no write permissions (I guess bad path) on the remote.

haavee commented 11 months ago

That's very useful to know, more hints at where to search. Thanks!

haavee commented 11 months ago

I think I have an idea what's happening: we introduced a counted-pointer-to-filedescriptor in the disk2net transfer and I suspect that the mechanism to close the file descriptor if the refcount drops to 0 does not work in this particular instance.

This change was introduced to fix a SIGSEGV: there is a code path that tries to access the disk2net file descriptor after it's been disconnected. Technically almost impossible, my feeling is that it's the UDT library in combination with a non-zero lingering setting: data might still be in flight whilst part of the code things are already done. Specifically, the sender has no means of knowing if or when all the data it has put into the socket have safely arrived at the other side. This is again due to the very old disk2net protocol definition (i.e. there isn't any) which means we cannot change the behaviour of disk2net w/o breaking a lot. For example, in vbs2vbs transfers, which were not held to conform to old Mark5ABC / MIT Haystack documented standards there is a tiny protocol, information exchange, between sender and receiver: the sender waits for an explicit ACK message from the receiver that all bytes have been received. For disk2net this cannot be done w/o breaking everything.

My expectation was that this mechanism would leak at most one file descriptor: it would keep the last one used open but as soon as a new transfer would be started that would replace the counted pointer with a new one and making the refcount of the previous opened file go to zero and have it deleted.

Now that I'm typing this I realise that this may be per runtime. So now we have each runtime leaking one fd because it does not get reused. Probably the "transient" runtimes with unique names interfere badly with each runtiming leaking something.

Possible ways forward:

try scripting the transfers and reuse a limited number of runtimes, so you can still parallelise but use the same runtime(s)
try to convince the remote side to accept vbs2vbs transfers and use e.g. vbs_fs to present the shrapnel that ends up on their system as individual files.

Note: vbs2vbs has more advantages: it is trivially parallelisable - you can specify how many chunks should be in flight at the same time and the sending and receiving jive5ab will use multiple threads for sending and multiple threads for receiving so that even TCP could become an option (one TCP stream = trickling, five parallel streams might do a lot; we tested between .FI and .NL back in the day and a single TCP stream went quite slow (few 100 Mbps on a 10 G link) but it scaled almost linearly with number-of-parallel-chunks-in-flight up to 5-6 Gbps.

jive-vlbi / jive5ab

set_disks? not entirely benign? #33