obgm / libcoap

A CoAP (RFC 7252) implementation in C
Other
790 stars 422 forks source link

Unexpected token after long running block transfer #1465

Closed anyc closed 2 months ago

anyc commented 2 months ago

Hello,

I am currently investigating an issue with my libcoap-based server and client. The client sends a firmware update to the server and waits for a response if the update was successful. After the file was received, the server will process the update and send the answer after the update finished. On the client side, after the update finished I first see a XMIT_BLOCK_FAIL event and then I receive a PDU with an unexpected token. To me it looks like the original token is not correctly restored. Please see the following example where you can see the last transmitted block and then the errors:

Jul 11 18:38:14.548 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: mid=0xdcf4: added to retransmit queue (2500ms)
Jul 11 18:38:14.548 DEBG *  192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: recv   46 bytes
v:1 t:ACK c:2.31 i:dcf4 {4c90000000000003} [ Block1:85135/M/1024 ]
Jul 11 18:38:14.548 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: mid=0xdcf4: removed (1)
Jul 11 18:38:14.548 DEBG found Block option, block size is 1024, block nr. 85135
Jul 11 18:38:14.548 DEBG *  192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: sent  599 bytes
Jul 11 18:38:14.548 DEBG *  192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: dtls:  sent  570 bytes
v:1 t:CON c:GET i:dcf5 {4c91000000000003} [ Uri-Path:v1, Uri-Path:system, Uri-Path:fw_update, Block1:85136/_/1024, Size1:87179784, Request-Tag:0x030c538f ] :: binary data length 520
<<00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>>
<<. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . >>
Jul 11 18:38:14.548 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: mid=0xdcf5: added to retransmit queue (2813ms)
Jul 11 18:38:14.549 DEBG *  192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: recv   33 bytes
v:1 t:ACK c:0.00 i:dcf5 {} [ ]
Jul 11 18:38:14.549 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: mid=0xdcf5: removed (1)
Jul 11 18:38:51.449 DEBG ** 0.0.0.0:45416 <-> 224.0.1.187:5683 UDP : lg_crcv 0x135c840 released
Jul 11 18:38:51.450 DEBG ** 0.0.0.0:52842 <-> 224.0.1.187:5683 UDP : lg_crcv 0x132a850 released
Jul 11 18:38:51.450 DEBG ** 0.0.0.0:56228 <-> 224.0.1.187:5683 UDP : lg_crcv 0x132bb80 released
Jul 11 18:39:03.855 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: lg_crcv 0x13e3cb0 released
Jul 11 18:39:47.552 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: lg_xmit 0x13dfb10 released
Jul 11 18:39:47.552 DEBG ***EVENT: COAP_EVENT_XMIT_BLOCK_FAIL
Jul 11 18:39:54.906 DEBG *  192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: recv   54 bytes
v:1 t:CON c:2.05 i:aeb5 {4c91000000000003} [ Block1:85136/_/1024 ] :: 'success'
unexpected rxtoken 216172782113821004 b'L\x91\x00\x00\x00\x00\x00\x03'
Jul 11 18:39:54.907 DEBG *  192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: sent   33 bytes
Jul 11 18:39:54.907 DEBG *  192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: dtls:  sent    4 bytes
v:1 t:RST c:0.00 i:aeb5 {} [ ]

As far as I understood the process, libcoap only removed one byte from the token and not two bytes. Did I understand this correctly? If yes, do you have an idea what I might be doing wrong here? If I use coap-client I do not get an error regarding a wrong token but after looking at the coap-client code, I did not see something obvious.

On the server side I use block mode COAP_BLOCK_USE_LIBCOAP and the client uses COAP_BLOCK_USE_LIBCOAP | COAP_BLOCK_SINGLE_BODY.

Thank you!

anyc commented 2 months ago

Ah, I missed the 'L' character in my output. So the token still contains both bytes that are used for the block transfer.

anyc commented 2 months ago

If it helps, the output of coap-client:

Jul 12 11:02:19.859 DEBG ** [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: mid=0xcd14: removed (1)
Jul 12 11:02:19.859 DEBG found Block option, block size is 1024, block nr. 85135
Jul 12 11:02:19.859 DEBG *  [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: netif: sent  599 bytes
Jul 12 11:02:19.859 DEBG *  [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: dtls:  sent  570 bytes
v:1 t:CON c:GET i:cd15 {4c91000000000003} [ Uri-Path:v1, Uri-Path:system, Uri-Path:fw_update, Block1:85136/_/1024, Size1:87179784, Request-Tag:0xddef4589 ] :: binary data length 520
<<00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>>
<<. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . >>
Jul 12 11:02:19.859 DEBG ** [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: mid=0xcd15: added to retransmit queue (2719ms)
Jul 12 11:02:19.859 DEBG *  [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: netif: recv   33 bytes
v:1 t:ACK c:0.00 i:cd15 {} [ ]
Jul 12 11:02:19.859 DEBG ** [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: mid=0xcd15: removed (1)
Jul 12 11:03:52.864 DEBG ** [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: lg_xmit 0x63b9b277ce60 released
Jul 12 11:03:52.864 DEBG ***EVENT: COAP_EVENT_XMIT_BLOCK_FAIL
Jul 12 11:04:02.047 DEBG *  [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: netif: recv   54 bytes
v:1 t:CON c:2.05 i:1d68 {4c91000000000003} [ Block1:85136/_/1024 ] :: 'success'
Jul 12 11:04:02.047 DEBG PDU presented to app.
v:1 t:CON c:2.05 i:1d68 {01} [ Block1:85136/_/1024 ] :: 'success'
Jul 12 11:04:02.051 DEBG ** [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: lg_crcv 0x63b9b277d440 released
Jul 12 11:04:02.051 DEBG ** process incoming 2.05 response:
successJul 12 11:04:02.051 DEBG *  [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: netif: sent   33 bytes
Jul 12 11:04:02.051 DEBG *  [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: dtls:  sent    4 bytes
v:1 t:ACK c:0.00 i:1d68 {} [ ]
Jul 12 11:04:02.051 DEBG *  [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: netif: sent   31 bytes
Jul 12 11:04:02.051 INFO *  [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: SSL3 alert write:warning:close notify
Jul 12 11:04:02.051 DEBG ***EVENT: COAP_EVENT_DTLS_CLOSED
Jul 12 11:04:02.051 DEBG ***[fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: session 0x63b9b2766e70: closed

The XMIT_BLOCK_FAIL seems to be "normal".

mrdeep1 commented 2 months ago

Several things here.

  1. We are not tracking all the blocks in the state token - so there will be a wrap around in the top 2 bytes in your case of a large file. Could be fixed, but does not help you here (85135 is 0x1c48f)
  2. I would expect you to be using PUT or POST for your file upload, not GET with a payload.
  3. It is the responsibility of the underlying libcoap client code to map the state token being used back to the original client token used at the start of the request - which is what has (correctly) happened here for coap-client (but not your client)
  4. In the client, the lg_xmittracking is released after MAX_TRANSMIT_WAIT (93 seconds) following the empty ACK response for the last transmitted block (No 2.xx code received). Hence COAP_EVENT_XMIT_BLOCK_FAIL.

I think the clue is (which does not happen with coap-client)

Jul 11 18:39:03.855 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: lg_crcv 0x13e3cb0 released

Which has timed out only 49 secs after

Jul 11 18:38:14.549 DEBG *  192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: recv   33 bytes
v:1 t:ACK c:0.00 i:dcf5 {} [ ]

Once lg_crcvgoes away, the logic has lost how to put back the correct token. However,

Jul 11 18:39:47.552 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: lg_xmit 0x13dfb10 released

which happened 93 seconds later indicates no change to MAX_RETRANSMIT which is used by both the lg_crcv and lg_xmittimeout logic. Looking into why lg_crcvtimed out in your client code.

mrdeep1 commented 2 months ago

See #1469 which fixes the early timeout of lg_crcvafter 49 seconds (my guess is that the file transfer started 44 secs previously). Please confirm if this helps.

However, If the time before Success is received after receipt of the empty ACK is greater than MAX_RETRANSMIT (93 secs), then there are still likely to be issues.

It may help here to send something like "Installing" after receipt of the firmware, and then send an unsolicited "Success" (or "Failure" or whatever) when the operation has completed (without any Block1 options).

The Server will need to track the initial Token used for block 0 and then use this Token for the unsolicited response. The client will need to keep track of the initial Token to accept the unsolicited response.

Alternatively the server can maintain a status of the installation (perhaps on the same URI, but with ?status appended) which can then be "Observed" with the server sending the "Success" when ready (with Observe option removed to signal no more).

anyc commented 2 months ago

Thank you for looking into this!

Unfortunately, it looks the same to me:

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . >>
Jul 12 18:31:02.604 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: mid=0xf0c2: added to retransmit queue (2406ms)
Jul 12 18:31:02.604 DEBG *  192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: netif: recv   33 bytes
v:1 t:ACK c:0.00 i:f0c2 {} [ ]
Jul 12 18:31:02.604 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: mid=0xf0c2: removed (1)

Jul 12 18:31:33.273 DEBG ** 0.0.0.0:45388 <-> 224.0.1.187:5683 UDP : lg_crcv 0x13a4c00 released
Jul 12 18:31:33.274 DEBG ** 0.0.0.0:35935 <-> 224.0.1.187:5683 UDP : lg_crcv 0x19ae0c0 released
Jul 12 18:31:33.274 DEBG ** 0.0.0.0:45326 <-> 224.0.1.187:5683 UDP : lg_crcv 0x19ae970 released

Jul 12 18:32:35.608 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: lg_crcv 0x1a333f0 released
Jul 12 18:32:35.611 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: lg_xmit 0x1557340 released
Jul 12 18:32:35.611 DEBG ***EVENT: COAP_EVENT_XMIT_BLOCK_FAIL
Jul 12 18:32:49.701 DEBG *  192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: netif: recv   54 bytes
v:1 t:CON c:2.05 i:7f2e {4c89000000000005} [ Block1:85128/_/1024 ] :: 'success'
asd
unexpected rxtoken 360287970189674828 b'L\x89\x00\x00\x00\x00\x00\x05'
Jul 12 18:32:49.701 DEBG *  192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: netif: sent   33 bytes
Jul 12 18:32:49.701 DEBG *  192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: dtls:  sent    4 bytes
v:1 t:RST c:0.00 i:7f2e {} [ ]

Regarding:

It is the responsibility of the underlying libcoap client code to map the state token being used back to the original client token used at the start of the request - which is what has (correctly) happened here for coap-client (but not your client)

In the client code:

https://github.com/obgm/libcoap/blob/bfdbdc3e253099f7418dba7a31eb6dc1643dcf2f/examples/coap-client.c#L409

the response handler immediately checks if the token is in its list of tokens without applying a mask or something to ignore the most significant two bytes. Hence, I thought libcoap would prepare received_pdu->token for the application.

In the coap-client output, there seems to be another steps involved before the application callback is executed:

v:1 t:CON c:2.05 i:1d68 {4c91000000000003} [ Block1:85136/_/1024 ] :: 'success'
Jul 12 11:04:02.047 DEBG PDU presented to app.
v:1 t:CON c:2.05 i:1d68 {01} [ Block1:85136/_/1024 ] :: 'success'

I tried to follow the code flow around the DEBG line but I my knowledge about libcoap/CoAP is still too small to understand what happens why here.

I will try this again on Monday, thank you!

mrdeep1 commented 2 months ago

Now the lg_crcvis not being removed after 49 secs, but after 93 secs (MAX_TIMEOUT) so that bit is working.

So

Jul 12 18:31:02.604 DEBG *  192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: netif: recv   33 bytes
v:1 t:ACK c:0.00 i:f0c2 {} [ ]
... MAX_TIMEOUT (93 secs) later
Jul 12 18:32:35.608 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: lg_crcv 0x1a333f0 released
Jul 12 18:32:35.611 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: lg_xmit 0x1557340 released
Jul 12 18:32:35.611 DEBG ***EVENT: COAP_EVENT_XMIT_BLOCK_FAIL
... 107 seconds after empty ACK
Jul 12 18:32:49.701 DEBG *  192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: netif: recv   54 bytes
v:1 t:CON c:2.05 i:7f2e {4c89000000000005} [ Block1:85128/_/1024 ] :: 'success'

(107 secs > 93 secs) is why we are running into an issue.

In the coap-client output, there seems to be another steps involved before the application callback is executed:

If lg_crcvor lg_xmitare still available, then the original token is available and is used by the libcoap lib code to replace the token as indicated by DEBG PDU presented to app.. Is your client and coap-client running on the same server, or is coap-client running faster and so is able to do the transfer in less than 93 secs?

It appears that your client is also sending other (multicast) traffic as well, shown by the additional lg_crvc being removed.

Jul 12 18:31:33.273 DEBG ** 0.0.0.0:45388 <-> 224.0.1.187:5683 UDP : lg_crcv 0x13a4c00 released
Jul 12 18:31:33.274 DEBG ** 0.0.0.0:35935 <-> 224.0.1.187:5683 UDP : lg_crcv 0x19ae0c0 released
Jul 12 18:31:33.274 DEBG ** 0.0.0.0:45326 <-> 224.0.1.187:5683 UDP : lg_crcv 0x19ae970 released

This may be adding in additional delays which is taking the client over 93 secs.

You may want to consider changing MAX_RETRANSMIT (from which MAX_TIMEOUT is derived) from the default of 4 to 5 by using coap_session_set_max_retransmit(). Then MAX_TIMEOUT will become 186 secs.

anyc commented 2 months ago

The clients are running on the same machine, yes. My client also shows a simple GUI and initially sends a multicast PDU to discover servers in the network. I think there shouldn't be a performance difference between my client and coap-client.

So, am I understanding this correctly: if I use the async functions from libcoap, this just enables my client to postpone sending the response to a request and does not send some sort of "keepalive" message from time to time after the initial empty ACK response? So I could increase the timeout you mentioned but if the update procedure on the server could possibly take an arbitrary long time this is never truly safe and I should switch to a different signaling procedure like you suggested?

Thank you!

mrdeep1 commented 2 months ago

My suggestion is that you go for something like

POST /v1/system/fw_update
...
2.31
...
2.01 (or 2.04) Uploaded (Sent back as soon as upload has finished)
GET /v1/system/fw_update Set Observe option
2.05 Installing Observe count set
...
2.05 Rebooting Observe count set
...
2.05 Success Observe count set
GET /v1/system/fw_update Cancel Observe option
2.05 Success 

The Observe can then report back whenever there is a status change for the F/W update process for however long that process takes.

anyc commented 2 months ago

Sounds good, thank you!