jive-vlbi / jive5ab

The JIVE VLBI data recording and transport utility
9 stars 5 forks source link

When does "tstat? 1 : Retry - we're initialized now" happen? #6

Closed varenius closed 4 years ago

varenius commented 4 years ago

I usually record 30 seconds of test-data before an experiment to verify that everything works. I have a script, initially developed by Simon Casey I think (with some changes by me) which sends the record=on, off commands and checks that the data rate is correct. The script uses the "tstat?" command for jive5ab to get the data rate. Sometimes this fails. This seems to be when after a tstat? question jive5ab answers "Reply: !tstat? 1 : Retry - we're initialized now : vbsrecord ;" instead of e.g. "Reply: !tstat? 0 : 1.00s : vbsrecord : UdpsNorRead 8.23231Gbps : F 0.0% ;". I haven't been able to understand when the "we're initialized now" reply occurs.

First I thought it was the first time after starting up jive5ab, but that's not true - I have got the reply twice in a row. Then I thought I must send it once to initialise, then wait some time (1 sec) and then send it again. But even so, I still get "we're initialised now" twice sometimes. Example log segment:

202028609:44:38 2020-10-12 07:44:38.93: Processing command 'version?'
202028609:44:38 2020-10-12 07:44:38.93: Reply: !version?  0 :  jive5ab : 3.0.0 : 64bit : Release : skirner : 09-Sep-2020 : 18h06m41s : nossapi :  ;
202028609:44:38 2020-10-12 07:44:38.93: Processing command 'rtime?'
202028609:44:38 2020-10-12 07:44:38.93: Reply: !rtime? 0 : 35602s : 36598.9GB : 9.38571% : VDIF : 128 : 0MHz : 8224Mbps ;
202028609:44:38 2020-10-12 07:44:38.93: Processing command 'mode?'
202028609:44:38 2020-10-12 07:44:38.93: Reply: !mode? 0 : VDIF_8192-8192-64-2 : VDIF : 128 : 64000000.000 : 8192 ;
202028609:44:38 2020-10-12 07:44:38.93: Processing command 'tstat?'
202028609:44:38 2020-10-12 07:44:38.93: Reply: !tstat? 0 : 0.0 : no_transfer ;
202028609:44:38 2020-10-12 07:44:38.93: Processing command 'record=on:testrec_freja_201012_074438'
202028609:44:39 2020-10-12 07:44:39.10: getsok: req. server socket@udpsnor:26411
202028609:44:39 2020-10-12 07:44:39.10: Reply: !record=  0  ;
202028609:44:39 2020-10-12 07:44:39.10: Processing command 'tstat?'
202028609:44:39 2020-10-12 07:44:39.10: Reply: !tstat?  1 : Retry - we're initialized now : vbsrecord ;
202028609:44:39 2020-10-12 07:44:39.10: udpsnorreader: fd=10 data:8224 total:8232 pkts:32640 avbs: 1
202028609:44:39 2020-10-12 07:44:39.10: per_sender_type[10.100.0.201:46227] - first sequencenr# 18840564630
202028609:44:39 2020-10-12 07:44:39.10: handle_seqnr[10.100.0.201:46227] - switch to ACK every 10th packet
202028609:44:39 2020-10-12 07:44:39.10: per_sender_type[10.100.0.202:46227] - first sequencenr# 18840252124
202028609:44:39 2020-10-12 07:44:39.10: handle_seqnr[10.100.0.202:46227] - switch to ACK every 10th packet
202028609:44:39 2020-10-12 07:44:39.10: per_sender_type[10.100.0.205:46227] - first sequencenr# 18839314638
202028609:44:39 2020-10-12 07:44:39.10: handle_seqnr[10.100.0.205:46227] - switch to ACK every 10th packet
202028609:44:39 2020-10-12 07:44:39.10: per_sender_type[10.100.0.206:46227] - first sequencenr# 18839002132
202028609:44:39 2020-10-12 07:44:39.10: handle_seqnr[10.100.0.206:46227] - switch to ACK every 10th packet
202028609:44:39 2020-10-12 07:44:39.10: per_sender_type[10.100.0.204:46227] - first sequencenr# 18839627146
202028609:44:39 2020-10-12 07:44:39.10: handle_seqnr[10.100.0.204:46227] - switch to ACK every 10th packet
202028609:44:39 2020-10-12 07:44:39.10: per_sender_type[10.100.0.207:46227] - first sequencenr# 18838689654
202028609:44:39 2020-10-12 07:44:39.10: handle_seqnr[10.100.0.207:46227] - switch to ACK every 10th packet
202028609:44:39 2020-10-12 07:44:39.10: per_sender_type[10.100.0.200:46227] - first sequencenr# 18840877146
202028609:44:39 2020-10-12 07:44:39.10: handle_seqnr[10.100.0.200:46227] - switch to ACK every 10th packet
202028609:44:39 2020-10-12 07:44:39.10: per_sender_type[10.100.0.203:46227] - first sequencenr# 18839939646
202028609:44:39 2020-10-12 07:44:39.10: handle_seqnr[10.100.0.203:46227] - switch to ACK every 10th packet
202028609:44:40 2020-10-12 07:44:40.10: Processing command 'tstat?'
202028609:44:40 2020-10-12 07:44:40.10: Reply: !tstat?  1 : Retry - we're initialized now : vbsrecord ;
202028609:45:09 2020-10-12 07:45:09.13: Processing command 'record=off'
202028609:45:09 2020-10-12 07:45:09.13: close_filedescriptor: closed fd#10
202028609:45:09 2020-10-12 07:45:09.14: udpsnorreader: stopping
202028609:45:09 2020-10-12 07:45:09.21: net2vbs guard function: transfer done
202028609:45:09 2020-10-12 07:45:09.69: Reply: !record=  1 ;
202028609:45:09 2020-10-12 07:45:09.69: Processing command 'evlbi?'
202028609:45:09 2020-10-12 07:45:09.69: Reply: !evlbi? 0 : total : 3754599 : loss : 0 ( 0.00%) : out-of-order : 0 ( 0.00%) : extent : 0seqnr/pkt ;
202028609:45:09 2020-10-12 07:45:09.69: Processing command 'scan_check?:4000000'
202028609:45:09 2020-10-12 07:45:09.69: openfile_type: found 30877822176 bytes in 116 chunks, 100%
202028609:45:09 2020-10-12 07:45:09.84: Reply: !scan_check? 0 : ? : testrec_freja_201012_074438 : VDIF : 128 : 2020y286d07h44m39.1021s : 30.0369s : 64Mbps : 148032 : 8224 ;

In contrast, here is a log segments taken a bit later where the "tstat?" commands ro return some numbers:

202028609:45:44 2020-10-12 07:45:44.08: Processing command 'version?'
202028609:45:44 2020-10-12 07:45:44.08: Reply: !version?  0 :  jive5ab : 3.0.0 : 64bit : Release : skirner : 09-Sep-2020 : 18h06m41s : nossapi :  ;
202028609:45:44 2020-10-12 07:45:44.08: Processing command 'rtime?'
202028609:45:44 2020-10-12 07:45:44.08: Reply: !rtime? 0 : 35572s : 36568GB : 9.37779% : VDIF : 128 : 0MHz : 8224Mbps ;
202028609:45:44 2020-10-12 07:45:44.08: Processing command 'mode?'
202028609:45:44 2020-10-12 07:45:44.08: Reply: !mode? 0 : VDIF_8192-8192-64-2 : VDIF : 128 : 64000000.000 : 8192 ;
202028609:45:44 2020-10-12 07:45:44.08: Processing command 'tstat?'
202028609:45:44 2020-10-12 07:45:44.08: Reply: !tstat? 0 : 0.0 : no_transfer ;
202028609:45:44 2020-10-12 07:45:44.08: Processing command 'record=on:testrec_freja_201012_074544'
202028609:45:44 2020-10-12 07:45:44.27: getsok: req. server socket@udpsnor:26411
202028609:45:44 2020-10-12 07:45:44.27: Reply: !record=  0  ;
202028609:45:44 2020-10-12 07:45:44.27: udpsnorreader: fd=10 data:8224 total:8232 pkts:32640 avbs: 1
202028609:45:44 2020-10-12 07:45:44.27: per_sender_type[10.100.0.202:46227] - first sequencenr# 18841270355
202028609:45:44 2020-10-12 07:45:44.27: handle_seqnr[10.100.0.202:46227] - switch to ACK every 10th packet
202028609:45:44 2020-10-12 07:45:44.27: per_sender_type[10.100.0.207:46227] - first sequencenr# 18839707894
202028609:45:44 2020-10-12 07:45:44.27: handle_seqnr[10.100.0.207:46227] - switch to ACK every 10th packet
202028609:45:44 2020-10-12 07:45:44.27: per_sender_type[10.100.0.204:46227] - first sequencenr# 18840645386
202028609:45:44 2020-10-12 07:45:44.27: handle_seqnr[10.100.0.204:46227] - switch to ACK every 10th packet
202028609:45:44 2020-10-12 07:45:44.27: per_sender_type[10.100.0.203:46227] - first sequencenr# 18840957886
202028609:45:44 2020-10-12 07:45:44.27: handle_seqnr[10.100.0.203:46227] - switch to ACK every 10th packet
202028609:45:44 2020-10-12 07:45:44.27: per_sender_type[10.100.0.200:46227] - first sequencenr# 18841895386
202028609:45:44 2020-10-12 07:45:44.27: handle_seqnr[10.100.0.200:46227] - switch to ACK every 10th packet
202028609:45:44 2020-10-12 07:45:44.27: per_sender_type[10.100.0.206:46227] - first sequencenr# 18840020363
202028609:45:44 2020-10-12 07:45:44.27: handle_seqnr[10.100.0.206:46227] - switch to ACK every 10th packet
202028609:45:44 2020-10-12 07:45:44.27: Processing command 'tstat?'
202028609:45:44 2020-10-12 07:45:44.27: Reply: !tstat?  0 : 64.17s : vbsrecord : UdpsNorRead -128.571Mbps : F 0.0% ;
202028609:45:44 2020-10-12 07:45:44.27: per_sender_type[10.100.0.201:46227] - first sequencenr# 18841582885
202028609:45:44 2020-10-12 07:45:44.27: handle_seqnr[10.100.0.201:46227] - switch to ACK every 10th packet
202028609:45:44 2020-10-12 07:45:44.27: per_sender_type[10.100.0.205:46227] - first sequencenr# 18840332893
202028609:45:44 2020-10-12 07:45:44.27: handle_seqnr[10.100.0.205:46227] - switch to ACK every 10th packet
202028609:45:45 2020-10-12 07:45:45.27: Processing command 'tstat?'
202028609:45:45 2020-10-12 07:45:45.27: Reply: !tstat?  0 :  1.00s : vbsrecord : UdpsNorRead 8.23231Gbps : F 0.0% ;
202028609:46:14 2020-10-12 07:46:14.29: Processing command 'record=off'
202028609:46:14 2020-10-12 07:46:14.29: close_filedescriptor: closed fd#10
202028609:46:14 2020-10-12 07:46:14.29: udpsnorreader: stopping
202028609:46:14 2020-10-12 07:46:14.39: net2vbs guard function: transfer done
202028609:46:14 2020-10-12 07:46:14.86: Reply: !record=  1 ;
202028609:46:14 2020-10-12 07:46:14.86: Processing command 'evlbi?'
202028609:46:14 2020-10-12 07:46:14.86: Reply: !evlbi? 0 : total : 3753656 : loss : 0 ( 0.00%) : out-of-order : 0 ( 0.00%) : extent : 0seqnr/pkt ;
202028609:46:14 2020-10-12 07:46:14.86: Processing command 'scan_check?:4000000'
202028609:46:14 2020-10-12 07:46:14.87: openfile_type: found 30870066944 bytes in 116 chunks, 100%
202028609:46:15 2020-10-12 07:46:15.01: Reply: !scan_check? 0 : ? : testrec_freja_201012_074544 : VDIF : 128 : 2020y286d07h45m44.2685s : 30.0314s : 64Mbps : 2327392 : 8224 ;

I have read the jiveab documentation, and the "we're initialised" is not mentioned there. Am I missing something here, or is the tstat behaviour unpredictable (although perhaps by design)? It would be good to know what is required to ensure tstat will return the UdpsNorRead rate. I am perfectly happy to run some commands, run them a number of times, wait some specific time etc. I just need to ensure how I get a valid number from tstat, so that my test-script won't fail.

haavee commented 4 years ago

That tstat? internal state is kept per runtime environment and the intent of the code is that that be cleared if a new transfer is started in that runtime (modulo bugs where that doesn't happen). The tstat? query pays no attention to who is querying - e.g. if multiple users call tstat? at random, the readings each will get depend critically on the order in which jive5ab processes the queries.

In the case of twice we're initialised it could be that the tstat? state was empty at the first call (e.g. first transfer started since program startup), causing that call to initialize it. If the transfer starting, due to multithread execution, clears the state after that call, then the 2nd call to tstat? again finds an empty state, initializes it, and informs you about that.

For your purpose, you might look at the tstat= command in stead. That version does not rely on state inside jive5ab to return relative values; it always returns the current values of the unix time stamp and defined byte counters. This allows client code to do the state analysis ("what is jive5ab doing right now") and a differentiation of time/byte counters to estimate the current data rate.

varenius commented 4 years ago

I see. Now I understand that part of the documentation, thanks!

Forgive my stupidity here, but trying to do the monitor-calculation using tstat= (as suggested in the documentation) it seems I am still missing something. To compare, I run both tstat= and tstat? as

202028616:46:12 2020-10-12 14:46:12.94: Processing command 'tstat='
202028616:46:12 2020-10-12 14:46:12.94: Reply: !tstat=  0 : 1602513972.941 : vbsrecord : UdpsNorRead : 1031700096 : FIFOLength : 0;
202028616:46:12 2020-10-12 14:46:12.94: Processing command 'tstat?'
202028616:46:12 2020-10-12 14:46:12.94: Reply: !tstat?  0 : 79.48s : vbsrecord : UdpsNorRead -207.628Mbps : F 0.0% ;
202028616:46:14 2020-10-12 14:46:14.94: Processing command 'tstat='
202028616:46:14 2020-10-12 14:46:14.94: Reply: !tstat=  0 : 1602513974.945 : vbsrecord : UdpsNorRead : 3094375872 : FIFOLength : 0;
202028616:46:14 2020-10-12 14:46:14.94: Processing command 'tstat?'
202028616:46:14 2020-10-12 14:46:14.94: Reply: !tstat?  0 :  2.00s : vbsrecord : UdpsNorRead 8.23161Gbps : F 0.0% ;

The "tstat?" result is 8.2Gbps, as it should be. But trying to use tstat= I put the numbers together as

bytediff = 3094375872-1031700096
timediff = 1602513974.945-1602513972.941
rate = 8*bytediff / (timediff * 1024**3) = 7.67 Gbps

I would expect these numbers to agree, i.e. 8 Gbps both ways. Any idea what I am missing? (It is monday afternoon after all...)

haavee commented 4 years ago
>>> bytediff = 3094375872-1031700096
>>> timediff = 1602513974.945-1602513972.941
>>> rate = 8*bytediff / timediff 
>>> rate
8234234846.332349
>>> rate / 1000**3
8.234234846332349
>>> 

It would seem that the Gbps should really be Gibps; jive5ab may actually predate the common use of "gibibytes and bits" vs "gigabytes and bits". The use of Gbps is misleading here. You could submit an issue report to have this fixed ... ;-)

varenius commented 4 years ago

Aha!

Allright - might as well :). I have now created #7 so I'm closing this ticket.