obgm / libcoap

A CoAP (RFC 7252) implementation in C
Other
786 stars 422 forks source link

Client fails to ACK a retransmitted CON message #1306

Closed pptz closed 7 months ago

pptz commented 8 months ago

Environment

libcoap Configuration Summary

I can provide details if needed, but I don't think anything other than me using an unreliable protocol (DTLS) to establish resource observation by a client matters here. The version I observe the issue for is v4.3.1-dirty, but the relevant section of the source (https://github.com/obgm/libcoap/blob/develop/src/coap_net.c#L3438) seems to be identical to development tip.

Problem Description

Resource observation subscription over unreliable transport protocol is sometimes dropped while the session is OK.

Expected Behavior

https://datatracker.ietf.org/doc/html/rfc7252#section-4.5: The recipient SHOULD acknowledge each duplicate copy of a Confirmable message using the same Acknowledgement or Reset message but SHOULD process any request or response in the message only once.

Actual Behavior

A duplicate message from a server is correctly processed by the client only once, but wrongly only generates an ACK/RST once, leading the server to remove the observing client from the resource observers list if it does not receive that single ACK (which may occur due to network issues).

Steps to reproduce

Suppose a client C subscribes to observe a resource on a server S using an underlying unreliable protocol (DTLS in my case). Every six (6) resource updates (by the way, the value of 6 seems to be hardcoded? Couldn't find a way to configure it), S makes its message confirmable, expecting an ACK in response. Suppose S sends such a message M with id MID, it is processed by C, and an ACK is sent by C. Suppose further that the ACK to M is not delivered to S (e.g. network loss). This makes S retransmit M (with the same MID). Now consider the code around https://github.com/obgm/libcoap/blob/develop/src/coap_net.c#L3438 The first time M arrives, the check rcvd->mid == session->last_con_mid fails, and last_con_mid is set to MID in line 3446. The next time M arrives, the above check passes, and M is silently dropped without being acknowledged. This leads S to retransmit M until the max threshold (4 by default), and then give up and drop the subscription.

Code to reproduce this issue

Since network loss of an ACK packet is not easy to reproduce, special code may need to be put together in order to circumvent sending an ACK by C at some point, and then observe system behaviour. Though code inspection seems sufficient to observe the bug.

Debug Logs

The sequence of events described under "Steps to Reproduce" was inferred from debug-level logs on the Server and Client (and also inserting debug prints in the LOC mentioned above).

Other items if possible

A proper fix may include remembering whether C generated an ACK or a RST in response to M, and in line 3443 calling either coap_send_ack() or coap_send_rst() before returning. I verified that inserting coap_send_ack() (which is not fully correct in the general case, I believe, but is fine in mine) solves the issue for me.

mrdeep1 commented 7 months ago

Thanks for raising this.

To help reproduce this, there is a -l (lower L) option supported by the examples coap-client and coap-server. This can be used to define the transmit loss on UDP sessions on a per packet by packet basis which may help accurately reproducing the issue. So, here, I would expect the coap-client to be told to drop certain ACK response packets. If you cou try this out an report back, that would be great.

FYI: COAP_OBS_MAX_NON is used to determine how many NONs can be sent before a CON is sent. It is also possible to set COAP_RESOURCE_FLAGS_NOTIFY_NON_ALWAYS (violates RFC7252 though) for the resource flags when the resource is being set up so that NONs are always sent.

pptz commented 7 months ago

Thanks for raising this.

To help reproduce this, there is a -l (lower L) option supported by the examples coap-client and coap-server. This can be used to define the transmit loss on UDP sessions on a per packet by packet basis which may help accurately reproducing the issue. So, here, I would expect the coap-client to be told to drop certain ACK response packets. If you cou try this out an report back, that would be great.

FYI: COAP_OBS_MAX_NON is used to determine how many NONs can be sent before a CON is sent. It is also possible to set COAP_RESOURCE_FLAGS_NOTIFY_NON_ALWAYS (violates RFC7252 though) for the resource flags when the resource is being set up so that NONs are always sent.

Thanks for pointing me to coap_debug_set_packet_loss. Using it on the subscribing client, I can observe the issue very clearly, as well as verify that my proposed solution works reliably in my test environment.

Thanks for pointing to COAP_OBS_MAX_NON as well.

mrdeep1 commented 7 months ago

@pptz. Thanks for confirming that you can reliably verify the issue. It would be good to see your code changes fix (even if for 4.3.1), or for you to raise a PR so that we can get the code changes into the current develop branch.

pptz commented 7 months ago

@mrdeep1 Well, I pushed a PR that reflects my understanding. I tested a similar fix on 4.3.1 and it works well, though I only tested the ACK case, not the RST case. Hope I didn't make silly mistakes in the logic. Note the addition of a field in struct coap_session_t to track whether a duplicate CON should receive an ACK or a RST.

mrdeep1 commented 7 months ago

Thanks for raising the PR. Apart from a minor addition, it looks fine to me.