sfzhi commented 4 months ago

Environment

Build System: [CMake]
Operating System: [Linux]
Operating System Version: [N/A]
Hosted Environment: [None]

libcoap Configuration Summary

PACKAGE VERSION..................4.3.4
PACKAGE SOURCE...................548370d-dirty
LIBRARY API VERSION..............3
LIBRARY ABI VERSION..............3.1.2
ENABLE_DTLS:.....................ON
ENABLE_TCP:......................ON
ENABLE_IPV4:.....................ON
ENABLE_IPV6:.....................ON
ENABLE_AF_UNIX:..................ON
ENABLE_WEBSOCKETS:...............OFF
ENABLE_Q_BLOCK:..................ON
ENABLE_CLIENT_MODE:..............ON
ENABLE_SERVER_MODE:..............ON
ENABLE_OSCORE:...................ON
ENABLE_ASYNC:....................ON
ENABLE_DOCS:.....................OFF
ENABLE_EXAMPLES:.................OFF
DTLS_BACKEND:....................default
WITH_GNUTLS:.....................ON
WITH_TINYDTLS:...................OFF
WITH_OPENSSL:....................OFF
WITH_MBEDTLS:....................OFF
HAVE_LIBTINYDTLS:................
HAVE_LIBGNUTLS:..................1
HAVE_LIBOPENSSL:.................
HAVE_LIBMBEDTLS:.................
WITH_EPOLL:......................ON
WITH_OBSERVE_PERSIST:............ON
BUILD_SHARED_LIBS:...............ON
MAX_LOGGING_LEVEL:...............8
WARNING_TO_ERROR:................OFF
CMAKE_C_COMPILER:................/usr/bin/cc
CMAKE_CXX_COMPILER_ID:...........GNU
CMAKE_BUILD_TYPE:................Debug
CMAKE_SYSTEM_PROCESSOR:..........x86_64
CMAKE_HOST_SYSTEM_NAME:..........Linux
CMAKE_GENERATOR:.................Unix Makefiles

Problem Description

When sending a GET request with COAP_OBSERVE_CANCEL to cancel previously established resource observation, coap_send(...) "automagically" replaces the request token (set with coap_add_token(...)) with a nonsensical value (01000000000001) before actually sending the request. As a result, the server does not recognize this request as a cancellation of an existing resource observation, because the token does not match anything the server is aware of.

Expected Behavior

If I call coap_add_token(...) to set the request token, it means I know what token needs to be there and I expect exactly that token to be sent to the server.

Actual Behavior

The server receives token (01000000000001) that has no meaning to it.

Steps to reproduce

Create request using coap_new_pdu(...).
Generate unique random token (without using coap_session_new_token(...)).
Call coap_add_token(...) to add the token to the request.
Add COAP_OPTION_OBSERVE with the value COAP_OBSERVE_ESTABLISH.
Add resource URI path using coap_add_uri_path(...).
Call coap_send(...) and enter/resume I/O processing loop.
Watch the request being sent to the server with the token set in step 3.
Wait for response from the server confirming established resource observation.
Create new request using coap_new_pdu(...).
Call coap_add_token(...) to add the same token generated in step 2.
Add COAP_OPTION_OBSERVE with the value COAP_OBSERVE_CANCEL.
Add the same resource URI path as in step 5 using coap_add_uri_path(...).
Call coap_send(...) and enter/resume I/O processing loop.
Watch the request being sent to the server with 01000000000001 as the token instead of the value set in step 9.

Debug Logs

Nothing is logged except for this one line, confirming that the worng token is :

v:1 t:CON c:GET i:6736 {e07c2f6667458b6b} [ Observe:, Uri-Path:data, Request-Tag:0x7f42623c ]
v:1 t:ACK c:2.05 i:6736 {e07c2f6667458b6b} [ Observe:2 ]
...
v:1 t:CON c:GET i:6737 {01000000000001} [ Observe:1, Uri-Path:data, Request-Tag:0x7f42623d ]
v:1 t:ACK c:2.05 i:6737 {01000000000001} [ ]

sfzhi commented 4 months ago

The "magic" token replacement occurs here. I have trouble understanding what that piece of code is trying to achieve. The comments around it don't seem to make much sense to me:

"See if this token is already in use for large body responses" - What if there haven't been any large body responses? Then it should not find any matches, right? But it does find a match anyway.
"Need to update token to server's version" - What on Earth is "server's version"? In my case, there has ever been only one token in existence, no different "versions" of it. And in general, where would that "version" come from?

mrdeep1 commented 4 months ago

@sfzhi Thanks for looking into this.

The intention here is that RFC9175 Updated Token Processing Requirements for Clients is followed.

In particular if the Client request's payload spans multiple blocks, then an unique token must be used for each block sent.

Likewise, if a response's payload spans multiple blocks, the client must use a unique token when asking for the subsequent blocks.

So the intent behind the logic here is to generate tokens that should be unique for each request. This is done (assuming tokens are up to a maximum of 8 bytes - they can be larger) by taking the 48 bits of the client initially provided token and using the top 16 bits as a sequence number for each individual block request.

So, with your code, step 10 should normally never be used (apart from an Observe Cancel)

The easiest way to cancel an Observe (which handles an Observe Fetch request where there are multiple blocks in the fetch payload) is to use coap_cancel_observe() with the token generated in step 2 as that also tracks the appropriate options that need to be a part of the observe cancellation request.

In your case, the lg_crcvhas been set up as it is unknown whether any unsolicited observe responses are going to be sent back as a multi-block payload. This then sets up the stateless token, in your case, 01000000000001 as you have not initialized the libcoap logic using coap_session_init_token() to some other value.

When this unique token work was done initially, the generated stateless token was use in the GET request to set up Observe. And, so the code you have found was reverting the token to generated stateless token when doing the Observe cancel.

Subsequently, it was decided to send the original token for the initial request (setting up the Observe) as people were getting confused when debugging as to why the token was getting changed, but this piece of code was not updated. The particular piece of code you have found has a bug in it which I will look at getting fixed. It was not picked up as people were in general using coap_observe_cancel().

Do you still see the issue when using coap_cancel_observe(3) instead of steps 9 - 13?

mrdeep1 commented 4 months ago

@sfzhi Please test #1369 for a fix and let us know how it goes.

sfzhi commented 4 months ago

@mrdeep1 Thanks for the detailed explanation. Apparently, coap_cancel_observe(...) works as expected. I have to admit I wasn't aware of that function. Strangely, it's not mentioned in the Resource Observation section of the manual.

The PR works fine too. Although I have some doubts whether calling coap_cancel_observe(...) from coap_send(...) is the desired solution. It basically replaces the request composed by the caller with a clone of the original subscription request, differing only in the Observe option value. I guess in most practical cases that will be fine, but what if the client wants the cancellation request to be somewhat different? For example, it might want to include an application-specific option to indicate to the server the reason for cancellation - I'm just making things up. To me this kind of "bait and switch" approach violates the principle of least surprise. I think it would be reasonable to expect the user of the library to be able to send any request they wish, no matter how "wrong" that request might be, as long as it is syntactically correct. Wouldn't you agree?

As a side note, I noticed that coap_cancel_observe(...) reuses, among other things, the Request-Tag option from the original subscription request. RFC9175 says: "The Request-Tag is intended for use as a short-lived identifier for keeping apart distinct block-wise request operations on one resource from one client...". To me subscription and cancellation are two distinct operations, so they should not have the same Request-Tag. Am I missing something?

By the way, according to the same RFC, the Request-Tag option is meant only for block-wise requests and clients are encouraged to not include it unless necessary. As far as I have noticed, libcoap includes it in every request, even when it's known to not involve anything block-wise. Is that desired or useful in any way?

mrdeep1 commented 4 months ago

Although I have some doubts whether calling coap_cancel_observe(...) from coap_send(...) is the desired solution

Yes, I was thinking about that too. One can replicate code in coap_send() that handles all necessary use cases that coap_cancel_observe()currently does, but that to me was a bit of overkill and likely to lead to issues in the future where only 1 of the 2 code areas were updated (as you stumbled into this issue). I also did not like the fact that a new PDU was generated.

I am thinking about a new coap_cancel_observe_internal() which both coap_send() and coap_cancel_observe() call after the initial PDU has been set up - and this function checks that all the required options etc are in place.

If the Request-Tag is there in the initial setting up of the Observe, it is there as a part of the Cache Key that the server needs to match against when cancelling the Observe request. So, re-using the same request-tag is valid here.

There has been much debate over the usage of Request-Tag. I successfully argued that RFC9175 was made more generic in the use of Request-Tag (https://github.com/core-wg/echo-request-tag/issues/77#issuecomment-1026011003).

The issue is that if there is no Block1 in the request, 2 requests to the same URI have been sent (the server data has changed between the issue of the 2 requests) and the server is sending back the 2 responses separately using Bock2. When a client sends the request for the next block - does the server send back the response for stream 1 or stream 2?

Or if an unsolicited Observe response starts, the server data changes and a second Observe response starts (using a different Etag). Using Etag in a request to send the next block may not send any data back (because data has not changed for that particular Etag). Without Request-Tag, there is no way that a client can ask for the next data block of the appropriate stream.

So libcoap, by default, always sends the request tag. However, you can use COAP_BLOCK_NO_PREEMPTIVE_RTAGin coap_context_set_block_mode() if you are certain that there will be no Block2 transfer confusion.

sfzhi commented 4 months ago

I was thinking about having no magic in coap_send() at all (except for what may be needed to maintain any internal state) and send the request as-is, assuming the caller knows what they are doing. Those who prefer convenience over flexibility can always call coap_cancel_observe() directly.

Now I understand the reasoning behind always including the Request-Tag. Although, I don't see how it would help in your second example with two unsolicited Observe responses. On one hand, RFC9175 clearly forbids having Request-Tag in responses. On the other hand, even if responses were allowed to contain Request-Tag, there would be only one value both of them could use - the one from the original subscription request. How would that help the client to tell those two responses apart when asking for the next data block?

mrdeep1 commented 4 months ago

Unfortunately, coap_send() adds in Request-Tag if not there, allocating a new value. So, if there is no magic, Observe setup will use one Request-Tag value and Observe cancel will use a different Request-Tag value - which is a no-no. So need to think this through further.

Yes, you are right about the second example - too much multi-tasking and not thinking things though! In this case, the client would be timing out the Etag from the first unsolicited Observe response as the new Etag only would be coming in.

I will get coap_cancel_observe() added into https://libcoap.net/doc/reference/develop/group__observe.html.

mrdeep1 commented 3 months ago

@sfzhi I have found that for the simple usage of GET to set up an Observe request, the Request-Tag needs to be corrected in coap_send() when doing the Observe cancel.

If FETCH is used to set up an Observe request, its data must be identical in the Observe cancellation case. Furthermore, if Block1 is used for setting up the request, the usage of tokens is important. This means that he Block1 used for the cancellation needs to be of the same block size. It is easier to call coap_cancel_observe() in this case from coap_send() rather than replicate all the use cases.

Updated code pushed in #1369, which also moves the group for coap_cancel_observe() in the documentation.

sfzhi commented 3 months ago

@mrdeep1 Great, works for me. Thank you!

mrdeep1 commented 3 months ago

I still am a bit concerned in letting through the simple GET case when using coap_send() to cancel an Observe request rather the code always invokes coap_cancel_observe() (as it does for the non simple cases).

The reason for this is if something 'special' is added to the GET request (in terms of options etc.) then there is a good chance that a 'standard' CoAP server would not do the cancellation as the Cache-Key does not match. One has to assume that not all users are 100% CoAP literate.

As per RFC 7641 3.6 Cancellation

In this case, a client MAY explicitly deregister by issuing a GET
request that has the Token field set to the token of the observation
to be cancelled and includes an Observe Option with the value set to
1 (deregister).  All other options MUST be identical to those in the
registration request except for the set of ETag Options.

so there is no room for anyone trying to be clever in the cancellation.

I will be putting back the change that always calls coap_cancel_observe(), but keeping the other bug fixes I found when testing this code out.

sfzhi commented 3 months ago

I understand your concern. I just don't feel good about the idea of something happening behind my back resulting in something else being sent than what I intended to. If there is really no good use case for wanting to compose a custom cancellation request, perhaps it would be better to have coap_send() reject those requests entirely, forcing everyone to use coap_cancel_observe() by showing an error message to that end? Technically, that would be an API breakage, but considering that nobody has complained about coap_send() not working properly in this case until now, I don't think many people would notice.

mrdeep1 commented 3 months ago

There is likely to be a reasonable number of older apps using libcoap where they have hard coded the PDU to do the Observe cancellation, as this would be from before coap_observe_cancel() came on the scene.

With the more recent introduction of Request-Tag, the current (before the PR associated with this issue) code would be using different Request-Tags for the Establish and Cancel PDUs, potentially breaking the server actually doing the Cancel (it so happens that the libcoap CoAP Server still does the Cancel). That is a bug that needs to be fixed which came to light when looking at this issue.

Similarly, the incorrect Token update you found is a bug. I don't want there to be any other bugs in Cancel.

If coap_send() refused to work, forcing users to use coap_cancel_observe(), that would break the API forcing users to modify their (old) code to get it to work. I do not think that is a good move.

One of the goals here with libcoap code changes is that the API and ABI should not broken. If broken, then there should be a bug fix.

mrdeep1 commented 3 months ago

If coap_context_set_block_mode() is not called (i.e. COAP_BLOCK_USE_LIBCOAPis not set), then libcoap does not do any checks / updates on the PDU passed to coap_send(). This would allow for some strange, custom, Observe cancellation should it be required.

mrdeep1 commented 3 months ago

1369 Has now been merged. Can this issue now be closed?

sfzhi commented 3 months ago

Yes, I think so.

obgm / libcoap

Request token is unexpectedly replaced with nonsense #1368

Environment

libcoap Configuration Summary

Problem Description

Expected Behavior

Actual Behavior

Steps to reproduce

Debug Logs

1369 Has now been merged. Can this issue now be closed?