jive-vlbi / jive5ab

The JIVE VLBI data recording and transport utility
9 stars 5 forks source link

Some files exist, some "Waiting for remote end to flush ... None" ? #11

Open varenius opened 3 years ago

varenius commented 3 years ago

I have seen this behaviour multiple times; time to ask about it. When running m5copy (jive5ab v3 on my side, probably also at recipient in Vienna) with --resume option (after transfers stopped for some reason), then I get a message that some files "already exists". However, my impression is that all files checked already exists. But, the message only appears for some of them. For the others, I get a few seconds of looking at "Waiting for remote end to flush ... None" which suggests there are no packets to flush? But then it should show "already exists" also for these files? Here is a bit of m5copy output:

oper@skirner ~ $ m5copy vbs://:2621/vo1074_ow* file://VIEN:2620/gpfs/cdata/incoming/ow/ -udt --resume -e 10 -r 2G
$Id: m5copy,v 1.60 2020/31/03 20:23:00 verkout Exp $
    copy VLBI data from somewhere to elsewhere
               (c) H. Verkouter

SRC::VBS [127.0.0.1:2621] vo1074_ow*  ===>  DST::FILE [VIEN:2620:<controlIP>] /gpfs/cdata/incoming/ow/
127.0.0.1::vo1074_ow_074-1800_0 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1800_0.m5a
127.0.0.1::vo1074_ow_074-1800_1 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1800_1.m5a
127.0.0.1::vo1074_ow_074-1800_2 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1800_2.m5a
127.0.0.1::vo1074_ow_074-1800_3 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1800_3.m5a
127.0.0.1::vo1074_ow_074-1800_4 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1800_4.m5a
/gpfs/cdata/incoming/ow/vo1074_ow_074-1800_4.m5a already exists, skipping
127.0.0.1::vo1074_ow_074-1800_5 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1800_5.m5a
127.0.0.1::vo1074_ow_074-1800_6 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1800_6.m5a
127.0.0.1::vo1074_ow_074-1800_7 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1800_7.m5a
127.0.0.1::vo1074_ow_074-1801_0 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1801_0.m5a
127.0.0.1::vo1074_ow_074-1801_1 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1801_1.m5a
/gpfs/cdata/incoming/ow/vo1074_ow_074-1801_1.m5a already exists, skipping
127.0.0.1::vo1074_ow_074-1801_2 => VIEN::/gpfs/cdata/incoming/ow/vo1074_ow_074-1801_2.m5a
Waiting for remote end to flush ... None

Here are my side of the j5ab logs for two represenativ scans, the 1801_0 which doesn't show the "already exists" message, and 1801_1 which does show the message:

202107608:42:31 2021-03-17 07:42:31.73: Processing command 'record=mk6:0'
202107608:42:31 2021-03-17 07:42:31.73: Reply: !record=  0 ;
202107608:42:31 2021-03-17 07:42:31.73: Processing command 'scan_set=vo1074_ow_074-1801_0::'
202107608:42:31 2021-03-17 07:42:31.73: openfile_type: found 3822901728 bytes in 15 chunks, 100%
202107608:42:31 2021-03-17 07:42:31.73: Reply: !scan_set= 0 ;
202107608:42:31 2021-03-17 07:42:31.73: Processing command 'scan_set?'
202107608:42:31 2021-03-17 07:42:31.73: Reply: !scan_set? 0 : -1 : vo1074_ow_074-1801_0 : 0 : 3822901728 ;
202107608:42:31 2021-03-17 07:42:31.73: Processing command 'disk2net=connect:193.170.79.54'
202107608:42:31 2021-03-17 07:42:31.73: compute_theoretical_ipd: ntrack=0 @0 dgsize=1472 => n_pkt_p_s=0
202107608:42:31 2021-03-17 07:42:31.74: openfile_type: found 3822901728 bytes in 15 chunks, 100%
202107608:42:31 2021-03-17 07:42:31.74: open_vbs: opened vo1074_ow_074-1801_0 as fd=2147483647 [vbs]
202107608:42:31 2021-03-17 07:42:31.74: getsok_udt: trying 193.170.79.54{193.170.79.54}:2630 ... 
202107608:42:31 2021-03-17 07:42:31.80: Reply: !disk2net=  1 ;
202107608:42:31 2021-03-17 07:42:31.80: udtwriter: writing to fd=1043591032
202107608:42:31 2021-03-17 07:42:31.80: Processing command 'disk2net?'
202107608:42:31 2021-03-17 07:42:31.80: Reply: !disk2net?  0 : connected : 193.170.79.54 : 0 : 0 : 3822901728 ;
202107608:42:31 2021-03-17 07:42:31.80: Processing command 'disk2net=on:+3822901728:'
202107608:42:31 2021-03-17 07:42:31.80: Reply: !disk2net=  0 ;
202107608:42:31 2021-03-17 07:42:31.80: vbsreader_c: start reading fd#2147483647, 3822901728 => 3822901728
202107608:42:31 2021-03-17 07:42:31.80: vbsreader_c: done 0byte
202107608:42:31 2021-03-17 07:42:31.80: Processing command 'disk2net?'
202107608:42:31 2021-03-17 07:42:31.80: Reply: !disk2net?  0 : active : 193.170.79.54 : 3822901728 : 3822901728 : 3822901728 ;
202107608:42:31 2021-03-17 07:42:31.80: udtwriter: re-enabling lingering on UDT socket for 180s
202107608:42:31 2021-03-17 07:42:31.80: udtwriter: stopping. wrote 0 (0byte)
202107608:42:31 2021-03-17 07:42:31.80: disk2net(vbs) guard function: transfer done
202107608:42:31 2021-03-17 07:42:31.80: close_vbs_c: closing fd#2147483647
202107608:42:31 2021-03-17 07:42:31.80: Processing command 'echo=off'
202107608:42:31 2021-03-17 07:42:31.80: close_filedescriptor: closed fd#1043591032
202107608:42:31 2021-03-17 07:42:31.80: disk2net(vbs) finalization done.
202107608:42:31 2021-03-17 07:42:31.83: Reply: !echo= 0 ;
202107608:42:42 2021-03-17 07:42:42.21: Processing command 'disk2net=disconnect'
202107608:42:42 2021-03-17 07:42:42.21: Reply: !disk2net=  6 : Not doing disk2net ;

202107608:42:52 2021-03-17 07:42:52.28: Processing command 'record=mk6:0'
202107608:42:52 2021-03-17 07:42:52.28: Reply: !record=  0 ;
202107608:42:52 2021-03-17 07:42:52.29: Processing command 'scan_set=vo1074_ow_074-1801_1::'
202107608:42:52 2021-03-17 07:42:52.29: openfile_type: found 3822770144 bytes in 15 chunks, 100%
202107608:42:52 2021-03-17 07:42:52.29: Reply: !scan_set= 0 ;
202107608:42:52 2021-03-17 07:42:52.29: Processing command 'scan_set?'
202107608:42:52 2021-03-17 07:42:52.29: Reply: !scan_set? 0 : -1 : vo1074_ow_074-1801_1 : 0 : 3822770144 ;
202107608:42:52 2021-03-17 07:42:52.29: Processing command 'disk2net=connect:193.170.79.54'
202107608:42:52 2021-03-17 07:42:52.29: compute_theoretical_ipd: ntrack=0 @0 dgsize=1472 => n_pkt_p_s=0
202107608:42:52 2021-03-17 07:42:52.29: openfile_type: found 3822770144 bytes in 15 chunks, 100%
202107608:42:52 2021-03-17 07:42:52.29: open_vbs: opened vo1074_ow_074-1801_1 as fd=2147483647 [vbs]
202107608:42:52 2021-03-17 07:42:52.29: getsok_udt: trying 193.170.79.54{193.170.79.54}:2630 ... 
202107608:42:52 2021-03-17 07:42:52.36: Reply: !disk2net=  1 ;
202107608:42:52 2021-03-17 07:42:52.36: udtwriter: writing to fd=1043591031
202107608:42:52 2021-03-17 07:42:52.36: Processing command 'disk2net?'
202107608:42:52 2021-03-17 07:42:52.36: Reply: !disk2net?  0 : connected : 193.170.79.54 : 0 : 0 : 3822770144 ;
202107608:42:52 2021-03-17 07:42:52.36: Processing command 'disk2net=on:+3822770144:'
202107608:42:52 2021-03-17 07:42:52.36: Reply: !disk2net=  0 ;
202107608:42:52 2021-03-17 07:42:52.36: Processing command 'disk2net?'
202107608:42:52 2021-03-17 07:42:52.36: Reply: !disk2net?  0 : connected : 193.170.79.54 : 3822770144 : 3822770144 : 3822770144 ;
202107608:42:52 2021-03-17 07:42:52.36: vbsreader_c: start reading fd#2147483647, 3822770144 => 3822770144
202107608:42:52 2021-03-17 07:42:52.36: vbsreader_c: done 0byte
202107608:42:52 2021-03-17 07:42:52.36: udtwriter: re-enabling lingering on UDT socket for 180s
202107608:42:52 2021-03-17 07:42:52.36: udtwriter: stopping. wrote 0 (0byte)
202107608:42:52 2021-03-17 07:42:52.36: disk2net(vbs) guard function: transfer done
202107608:42:52 2021-03-17 07:42:52.36: close_vbs_c: closing fd#2147483647
202107608:42:52 2021-03-17 07:42:52.36: close_filedescriptor: closed fd#1043591031
202107608:42:52 2021-03-17 07:42:52.36: disk2net(vbs) finalization done.
202107608:42:53 2021-03-17 07:42:53.36: Processing command 'disk2net?'
202107608:42:53 2021-03-17 07:42:53.36: Reply: !disk2net?  0 : inactive ;
202107608:42:53 2021-03-17 07:42:53.39: Processing command 'disk2net=disconnect'
202107608:42:53 2021-03-17 07:42:53.39: Reply: !disk2net=  6 : Not doing disk2net ;
202107608:43:03 2021-03-17 07:43:03.43: Processing command 'record=mk6:0'
202107608:43:03 2021-03-17 07:43:03.43: Reply: !record=  0 ;

Interestingly, it seem to say "wrote 0 (0byte)" in both cases. So my assumption that both files were actually transferred seem correct. Curious then why the "already exists" message is only shown for one of them?

haavee commented 3 years ago

This probably has to to with (yet another) timing issue. Whether m5copy displays the message ... already exists, skipping depends on the code handling the actual transfer throwing an IgnoreFile exception. However, the condition under which this is thrown is subject to timing, see the following snippet, and also the preceding comment. Sometimes the sender finishes itself before the while loop is entered, and sometimes it doesn't.

It would be interesting to run m5copy -d ... -d is the sekrit hidden command line option that prints all traffic (commands and replies) between m5copy and the sending/receiving jive5ab)

        # HV: 9 dec 2014  GRRRR. Roger H. finds that, sometimes,
        #                 "disk2net=on" takes so long that it remains
        #                 in the connected state by the time we get to the
        #                 .progress(self) method. So we must wait here
        #                 for the status to become active
        #     4 jan 2016  BE finds that status could go to inactive
        #                 if the transfer finishes (very) quickly
        while True:
            r = CTRL.send_query("disk2net?", [0])
            if "connected" not in r:
                break
            progress_print("\r>>>> Waiting for disk2net to start")
            time.sleep(1)
        if "inactive" in r:
            # Sender immediately stopped sending because everything is already there!
            raise IgnoreFile