pflarue / ardop

Source for various version of ARDOP
Other
29 stars 5 forks source link

Spradic [Repeating Last Frame] neverendin loop #66

Open IU1IPB opened 3 months ago

IU1IPB commented 3 months ago

Hello Peter,

I'm sporadically encountering the issue quite well described in the attached logs. The logs comes from the second (in few minutes) connection with a nearby gateway, with signal 559 as usual (on 80m). The first connection, not included in the log, was fast and perfect. I have repeated it in few minutes, but, as you can see in the log, after an "ARQ CONNECTION ESTABLISHED WITH HB9PC, SESSION BW = 2000 HZ", a neverending loop of :

[DecodeFrame] Frame: DataACK [Repeating Last Frame] Sending Frame Type IDLE [DecodeFrame] Frame: DataACK [Repeating Last Frame] Sending Frame Type IDLE [DecodeFrame] Frame: DataACK

....

...starts and stays till timeout or session killing. This strongly sporadically happens not only with this gateway but also with other CMS (e.g. HB9AK) : of course it's not a problem to stop and start again a new connection, that usually succeeds without issue, as happened this morning. It seems to be a small bug, perhaps in the CMS software, but I let it to your analysis. I'm on ardopcf Version 1.0.4.1.2 over PI/OS Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 with last version of pat-

Thank-you Best regards IU1IPB Ugo

ko_debug.log ko_log.log

pflarue commented 3 months ago

@IU1IPB, I appreciate your creating an Issue for this problem. I have experienced this myself, most recently a couple of days ago.

It is unclear which program is causing this problem. During this exchange there are four programs interacting with each other. There is ardopcf, Pat, the Ardop implementation at the gateway station (probably not ardopcf), and the host program at the gateway station. I believe that this problem is at the gateway end since it is sending DataACK when BREAK is expected. However, it is possible that the reason for this is that ardopcf and Pat failed to transmit something correctly before sending IDLE. Further examination is required, and it may be slow if I cannot reliably reproduce the problem.

Once the root cause of the problem is found, if it cannot be fixed with changes to ardopcf, I will try to help the maintainers of the other programs to understand what needs to be fixed.

UgoPoddine commented 3 months ago

Thank-you as usual, Peter. I can add few remarks more, probably not useful. A) at least a couple of years ago I had made John aware of the same issue, then, if it was at client side, this is probably long lasting. B) It seems to me that the issue happens when propagation conditions are very good, than when PSK frames are involved. Not sure .. C) I have always found the sysop of HB9AK very kind : if you need my help for testing, perhaps with this CMS, feel free to ask. Of course, being sporadic, it's difficult to replicate...

73, Ugo

Dinsmoor commented 2 months ago

I put the exchange in your log file in a table

ISS = Information Sending Station

IRS = Information Receiving Station

log line calling station cms
97 ConReq2000
109 ConAck2000
120 ConAck2000
133 DataAck (begin session as IRS)
143 IDLE (begin session as ISS)
154 BREAK (try to becomes ISS)
157 DataAck (ok; ISStoIRS)
169 [128 bytes data] 4PSK.500.100.E
181 DataAck (await next frame)
189 [25 bytes data] 4PSK.500.100.O
200 DataAck (await next frame)
210 IDLE (we have no data)
217 BREAK (try to become ISS)
225 DataACK (ok;ISStoIRS)
230 [77 bytes data] 4PSK.1000.100.E
239 DataAck(await next frame)
246 IDLE
255 DataACK
repeats repeats repeats

From looking at this table, it seems that your host application had nothing more to send to the CMS, and the CMS never tried to take control of the information flow with the OpenB2F (with Winlink extensions) protocol here: https://winlink.org/B2F

I was experimenting with ARQ mode with my own host application, and a similar endless IDLE/DataACK transaction occurs whenever the calling station doesn't end up following the B2F protocol (I connected to a winlink station in ARQ mode with hamchat, but I didn't sent the ;PQ/;PR authentication challenge and response, so it just sat idle, waiting for me to issue an authentication challenge.

https://winlink.org/sites/default/files/downloads/winlink_data_flow_and_data_packaging.pdf Screenshot_20240701_153431

The Calling Station (ISS) (your station) never sends the ;PR nnnnn when it's supposed to.

This may indicate a bug in Pat's handling of the B2F protocol, or how it reads and writes to the ardopcf data socket, or how ardopcf actually passes data to the host. Pat seems to work mostly okay whenever I use it with packet with direwolf, but I have also had issues where I could connect to a CMS station, they would talk with ax25, but no data was ever exchanged.

I have documented a few of these issues with passing message data to the host under "FEC" related bugs, (issue #35 is an example of something somewhat similar) so it may not actually be Pat's fault, but that it may have got old/bad data from the data socket. If Pat couldn't read correct 157 bytes that should have been passed to it from ardopcf (see Issue #35) then it would never respond with ;PR nnn authentication response, because it never saw the ';PQ nnnn` challenge.

Just my thoughts.

Dinsmoor commented 2 months ago

Today I got authorized as a sysop with Winlink.org so I could reproduce this locally. (running callsign k7otr for both stations) After setting up my own RMS Trimode/RMS Relay in my pickup truck, I had my home station continuously connect to the station.

for i in $(seq 1 10);
 do
  pat connect ardop:///K7OTR?bw=500MAX
 done

A normal session log for RMS Relay looks like this:

Logfiles

``` 2024/07/02 18:00:30 Operating mode: Network post office. Store messages locally for pickup. 2024/07/02 18:01:12 Received incoming connection 11429296 from 127.0.0.1 Current connection count = 0 2024/07/02 18:01:12 Link type Ardop_2000 on freq 28124500 2024/07/02 18:01:41 Local parse incoming SID: [PAT-0.16.0-B2FHMG$] 2024/07/02 18:01:46 Disconnection of 11429296 2024/07/02 18:01:46 RemoveConnectionInstance called by TCP disconnect 2024/07/02 18:01:46 Session closed by RemoveConnectionInstance: TCP disconnect 2024/07/02 18:01:46 2024/07/02 18:01:12 *** Connected to RMS Relay 2024/07/02 18:01:12 K7OTR K7OTR 2024/07/02 18:01:12 ;SQ: 70554751 2024/07/02 18:01:12 ;SR: 48402701 28124500 43 2024/07/02 18:01:12 Network post office. Messages will be stored locally for pickup by the recipient. 2024/07/02 18:01:12 ;WL2K-Post_Office 2024/07/02 18:01:12 [WL2K-3.2.10.0-B2FWIHM$] 2024/07/02 18:01:12 ;PQ: 84653817 2024/07/02 18:01:12 K7OTR RMS Relay > 2024/07/02 18:01:41 ;FW: K7OTR 2024/07/02 18:01:41 [Pat-0.16.0-B2FHMG$] 2024/07/02 18:01:41 ;PR: 77535409 2024/07/02 18:01:41 ; K7OTR DE K7OTR (CN88Qi) 2024/07/02 18:01:41 FF 2024/07/02 18:01:41 FQ 2024/07/02 18:01:46 *** Disconnected from RMS Relay ``` The failed transaction looks like this: ``` 2024/07/02 18:02:35 Received incoming connection 41622463 from 127.0.0.1 Current connection count = 0 2024/07/02 18:02:35 Link type Ardop_2000 on freq 28124500 2024/07/02 18:06:24 Disconnection of 41622463 2024/07/02 18:06:24 RemoveConnectionInstance called by TCP disconnect 2024/07/02 18:06:24 Session closed by RemoveConnectionInstance: TCP disconnect 2024/07/02 18:06:24 2024/07/02 18:02:35 *** Connected to RMS Relay 2024/07/02 18:02:35 K7OTR K7OTR 2024/07/02 18:02:35 ;SQ: 70554751 2024/07/02 18:02:35 ;SR: 48402701 28124500 43 2024/07/02 18:02:35 Network post office. Messages will be stored locally for pickup by the recipient. 2024/07/02 18:02:35 ;WL2K-Post_Office 2024/07/02 18:02:35 [WL2K-3.2.10.0-B2FWIHM$] 2024/07/02 18:02:35 ;PQ: 84653817 2024/07/02 18:02:35 K7OTR RMS Relay > 2024/07/02 18:06:24 *** Disconnected from RMS Relay ```

So we are missing the ;FW: K7OTR and thereafter which is supposed to come from Pat over the air.

When running Pat with DEBUG_MODE=1 and ARDOP_DEBUG=1 envars, pat sees the following from ardopcf's data buffer:

Pat/ardopcf log

``` Network post office. Messages will be stored locally for pickup by the recipient. ;WL2K-Post_Office [WL2K-3.2.10.0-B2FWIHM$] ;PQ: 84653817 K7OTR RMS Relay > > 2024/07/02 11:36:30 <-- PTT TRUE [ardop.ctrlMsg{cmd:"PTT", value:true}] 2024/07/02 11:36:30 frame PTT FALSE 2024/07/02 11:36:30 <-- PTT FALSE [ardop.ctrlMsg{cmd:"PTT", value:false}] 2024/07/02 11:36:30 frame BUFFER 72 2024/07/02 11:36:30 <-- BUFFER 72 [ardop.ctrlMsg{cmd:"BUFFER", value:72}] >FF 2024/07/02 11:36:30 frame BUFFER 75 2024/07/02 11:36:30 <-- BUFFER 75 [ardop.ctrlMsg{cmd:"BUFFER", value:75}] 2024/07/02 11:36:31 frame NEWSTATE IRStoISS 2024/07/02 11:36:31 <-- NEWSTATE IRStoISS [ardop.ctrlMsg{cmd:"NEWSTATE", value:0x0}] 2024/07/02 11:36:31 frame STATUS QUEUE BREAK new Protocol State IRStoISS 2024/07/02 11:36:31 <-- STATUS QUEUE BREAK new Protocol State IRStoISS [ardop.ctrlMsg{cmd:"STATUS", value:"QUEUE BREAK new Protocol State IRStoISS"}] 2024/07/02 11:36:31 frame PTT TRUE 2024/07/02 11:36:31 <-- PTT TRUE [ardop.ctrlMsg{cmd:"PTT", value:true}] 2024/07/02 11:36:32 frame PTT FALSE 2024/07/02 11:36:32 <-- PTT FALSE [ardop.ctrlMsg{cmd:"PTT", value:false}] ```

We see that ardopcf has 75 bytes in its outgoing buffer, and then when we send it, we go into the proper idle loop:

ardopcf debug log

``` 18:36:33.205 Command Trace TO Host NEWSTATE ISS 18:36:33.205 [ARDOPprotocol.SendData] DataToSend = 75 bytes, In ProtocolState ISS 18:36:33.205 [ARDOPprotocol.GetNextFrameData] No shift, Frame Type: 4PSK.500.100.E 18:36:33.205 Sending Frame Type 4PSK.500.100.E 18:36:33.205 [Main.KeyPTT] PTT-TRUE 18:36:35.776 Tx Time 4460 Time till end = 1888 18:36:37.705 [Main.KeyPTT] PTT-FALSE 18:36:37.737 Soft Clips 1047 18:36:38.256 Prior-Offset= -0.348522 18:36:38.357 EnvelopeCorrelator CorMax:MaxProd= 61.286213 J= 1 18:36:38.357 [Acquire2ToneLeaderSymbolFraming] intIatMinError= 1 18:36:38.550 Frame Decode type e8 e8 f6 Dist 0.01 0.01 0.21 Sess 3e pend 0 conn 1 lastsess 0 18:36:38.550 [Frame Type Decode OK ] Ldr; S:N(3KHz) Early= -4.509895 dB, Full -8.936739 dB, Offset= -43.703991 Hz: MD Decode;8 ID=H3E, Critical Type=HE8: DataACK, D1= 0.01, D2= 0.01 18:36:38.550 [DecodeFrame] Frame: DataACK 18:36:38.550 Time since received = 0 18:36:38.800 Command Trace TO Host BUFFER 0 18:36:38.800 [ARDOPprotocol.ComputeQualityAvg] Initialize AvgQuality= 54 18:36:38.800 Command Trace TO Host NEWSTATE IDLE 18:36:38.800 Command Trace TO Host BUFFER 0 18:36:38.800 Sending Frame Type IDLE 18:36:38.800 [Main.KeyPTT] PTT-TRUE 18:36:38.859 Tx Time 490 Time till end = 432 18:36:39.305 [Main.KeyPTT] PTT-FALSE 18:36:39.337 [ARDOPprotocol.SendData] Send IDLE with Repeat, Set ProtocolState=IDLE ```

The data that pat loads into the ardopcf buffer, below IS transmitted over the air properly. (75 bytes)

;FW: K7OTR
[Pat-0.16.0-B2FHMG$]
;PR: 77535409
; K7OTR DE K7OTR (CN88Qi)
FF

ardopcf is awaiting a BREAK frame from the RMS Trimode/ARDOP_WIN.

The RMS never BREAKs because it never got the 75 bytes from ARDOP_WIN that was sent via ardopcf via Pat.

I think this suggests an issue on the RMS side (with ARDOP_WIN).

The ARDOP_WinTNC logfile

Details

18:36:40.786 [Frame Type OK]9 BREAK, D1=.04, D2=.01 Ldr; S:N(3KHz) Early=-7dB, Full=-4dB, Offset=42.7Hz 18:36:40.786 [DecodeFrame] BREAK PASS 18:36:40.787 [Main.SendFrame] Send DataACK_Q100 delayed 100 ms 18:36:40.787 [ARDOPprotocol.ProcessRcvdARQFrame] BREAK Rcvd from IDLE, Go to IRS, Substate IRSfromISS 18:36:40.889 [Main.MainPoll4] objProtocol.Repeating=True ProtocolState=IRS blnFramePending=True next Play in 3 ms 18:36:40.889 [Main.KeyPTT] True 18:36:40.889 [PlaySoundStream] DataACK_Q100 18:36:40.889 [Main.PlaySoundStream] set blnSoundStreamPlay to True 18:36:41.355 [MainPoll] Play stop. Length = 466 ms 18:36:41.356 [Main.KeyPTT] False 18:36:41.356 GetNextARQFrame intPingRpt=0 18:36:42.336 [Frame Type OK]10 4PSK.500.100.E, D1=.06, D2=.05 Ldr; S:N(3KHz) Early=-9dB, Full=-4dB, Offset=42.6Hz 18:36:46.346 [CorrectPhaseForTuningOffset] 4PSK AvgOffsetBeginning=24 AvgOffsetEnd=26 AccOffsetCnt=362/396 18:36:46.347 [CorrectPhaseForTuningOffset] 4PSK AvgOffsetBeginning=20 AvgOffsetEnd=15 AccOffsetCnt=358/396 18:36:46.347 [DecodeFrame] 4PSK.500.100.E FAIL, Quality= 54 18:36:46.348 [Main.SendFrame] Send DataACK_Q54 delayed 100 ms 18:36:46.348 [ARDOPprotocol.ProcessRcvdARQFrame] Data Decode Failed but Frame Type matched last ACKed. Send ACK, data already passed to host. 18:36:46.485 [Main.MainPoll4] objProtocol.Repeating=True ProtocolState=IRS blnFramePending=True next Play in 38 ms

We have the line "Data Decode Failed but Frame Type matched last ACKed. Send ACK, data already passed to host." (no it wasn't)

ARDOP_WIN sends a DataACK frame despite it not decoding correctly, because the frame type of the last sent was the same. Here's the whole log: ARDOP_WinTNC_Debug_20240702.log

I think this is an issue with ARDOP_WIN and also I have noticed this behavior in ardopcf when testing my application hamChat (sometimes ardopcf simply wouldn't pass a decoded frame to the host, even if it was a different one) because it would think it was a duplicate.

Dinsmoor commented 2 months ago

It looks like the code for this section is likely identical to the WIN_ARDOP implementation, so this bug would affect ardopc and ardopcf as well, since it was last revised in 2016.

ARQ.c:1646-1662

// handles Data frame which did not decode correctly but was previously ACKed to ISS  Rev 0.4.3.1  2/28/2016  RM
// this to handle marginal decoding cases where ISS missed an original ACK from IRS, IRS passed that data to host, and channel has
// deteriorated to where data decode is now now not possible.

if ((!blnFrameDecodedOK) && intFrameType == bytLastACKedDataFrameType)
{
    // Send ACK
    blnEnbARQRpt = FALSE;
    if ((EncLen = EncodeDATAACK(intLastRcvdFrameQuality, bytSessionID, bytEncodedBytes)) <= 0) {
        WriteDebugLog(LOGERROR, "ERROR: In ProcessRcvdARQFrame() for repeat DataACK Invalid EncLen (%d).", EncLen);
    return;
    }
    Mod4FSKDataAndPlay(bytEncodedBytes[0], &bytEncodedBytes[0], EncLen, LeaderLength);  // only returns when all sent
    WriteDebugLog(LOGINFO, "[ARDOPprotocol.ProcessRcvdARQFrame] Data Decode Failed but Frame Type matched last ACKed. Send ACK, data already passed to host. ");

    }

I don't think that this logic works properly for flow control. I don't think that bytLastACKedDataFrameType is always properly reset to 0 after the end of an ARQ session.

If we look between ARDOP_WinTNC_Debug_20240702.log line 2459 and 3069, it's the same 4PSK.500.100.E data frame BETWEEN ARQ SESSIONS.

InitializeConnection(); in ARQ.c is supposed to reset this, but when going to the DISC state (in between arq sessions), it doesn't seem to, in the special case that we never decode a DISC frame but we also never timeout, like if the calling station disconnected early, (dirty disconnect) and then tried to reconnect, and then the same data frame is used again.

Setting bytLastACKedDataFrameType = 0; in void SetARDOPProtocolState()'s DISC switch case might fix this for ardopcf.

In a normal transaction, this is reset when we decode a DISC frame, because InitializeConnection() is called.

But I'm in a little bit over my head. @pflarue if you have any insight, you're a lot more familiar with this than I am. I am going to take a break.

Dinsmoor commented 2 months ago

I'll do a summary here since my last few comments were mostly exploratory on how I got to the conclusion I did.

I dug into this more today with a fresh set of eyes. I this bug was already fixed in the ardopc source on 3/15/2018 (ARQ.c:318), but it looks like it was never backported to the parent windows implementation. ardopcf does not has this bug, because it is reset whenever each station switches from being the ISS or IRS.

Disclaimer: This issue is present in the Windows ARDOP implementation that ships with RMS Trimode. Maybe in the future we can get the windows implementation source code and patch it for distribution, or see if the Winlink guys would be willing to use ardopcf instead in the future (it has significantly better decoding performance and a lot of other bug fixes).

But for anyone interested in this, otherwise, here's a description of it:

This bug happens in the second of two subsequent ARQ sessions, when:

For example: the last data frame decoded in session 1 is 4FSK.200.100.E, it is successfully decoded. The first data frame identified but NOT decoded in session 2 is also 4FSK.200.100.E. A DataACK is mistakenly sent instead of a DataNAK, and this breaks the B2F protocol. RMS Trimode is endlessly waiting for the calling station to identify itself, so it never loads data into the buffer to send back to the calling station, ergo the endless idle loop.

The root cause of this is because when ARDOP_WIN goes into the DISC or switches between being ISS/IRS, the variable bytLastACKedDataFrameType isn't always reset.

I'm retracting my PR because ardopcf does not have this bug, the windows implementation that ships with RMS trimode does.