Clarify handling of empty strings

simon-budig commented 2 years ago

There is an ambiguity in the specification regarding empty strings.

Properties with a string value might be empty, the specification states that "An empty string ("") is a valid payload".

Also section 7.2 specifies that "Devices can remove old properties and nodes by publishing a zero-length payload on the respective topics." which poses the question how a message deleting a property differs from a message setting a string-property to the empty string.

The specification needs to specify what bytes to set as payload to represent the empty string. One option might be to set the payload to a single "zero" byte to represent an empty string.

Tieske commented 1 year ago

Setting a single zero-byte or 'a null-terminated empty string' makes sense. The MQTT server will recognize it as a payload, and the Homie client can transparently replace it with an empty string for its internal state.

Tieske commented 1 year ago

PS. a single 0-byte is also valid UTF-8 in case someone wonders 😄

Tieske commented 1 year ago

In the PR for v5 the 0 byte was changed to a 1 value, to be safe for errors wrt null-terminated strings.

simon-budig commented 1 year ago

uh, what?

I suggested the 0 byte because it is a string terminator in a ton of languages, so there is no confusion about how to interpret it. There is no special casing needed at all to use all of the posix string manipulating functions etc.

By arbitrarily changing this to 1 you just made the string consisting of one "SOH" control character (ascii 0x01) unrepresentable in homie.

Now, we can argue about the usefulness of that string, but it is certainly weird and outside of any convention to assign the "there is no string here" meaning to the ascii-0x01 character.

The whole point of using 0x00 here is, to piggy back on existing well established conventions.

MQTT payloads always are just byte arrays, so any software implementing the homie convention has to validate the payload anyways (before treating the bytes as a string datatype), so I don't see what the actual benefit would be here to check for 0x01 instead of 0x00.

Tieske commented 1 year ago

reopening.

To me the essence was having an escape sequence to represent the empty string. Then there's 2 options; implement generic escaping, or use a value that's unlikely to occur (and make it someone elses problem if it does occur). And this proposal takes the latter approach (because even 0x00 is a valid payload for Homie, since its valid UTF8).

I suggested the 0 byte because it is a string terminator in a ton of languages, so there is no confusion about how to interpret it. There is no special casing needed at all to use all of the posix string manipulating functions etc.

The 0x00 byte transmitted over MQTT is not systems data, it is an application level payload being transmitted. And imho the convention is use 0x00 in system data, but also; stay away from 0x00 for any application data.

It is indeed the terminator, and hence it is dangerous, because it means different things on different levels. Because in Homie strings are NOT null terminated, it is always a special case. Now the underlying implementations must make special arrangements because the in-memory layout will be 2 0x00 bytes right after each other. and this could easily lead to bugs.

By arbitrarily changing this to 1 you just made the string consisting of one "SOH" control character (ascii 0x01) unrepresentable in homie.

why would this be unrepresentable in Homie? (then 0x00 sure would suffer the same problem) afaik it is valid UTF8, and can be implemented easily in any homie lib. Just a single check-and-exchange whenever a value is received.

Now, we can argue about the usefulness of that string, but it is certainly weird and outside of any convention to assign the "there is no string here" meaning to the ascii-0x01 character.

The whole point of using 0x00 here is, to piggy back on existing well established conventions.

It's just an escape placeholder. No more, no less. And as mentioned above, because others have many conventions around 0x00 (on system level), I think Homie would be safer to stay away from specifically that one (on application level).

0x00 and 0x01 are interchangeable on application level (= homie), and both will need an extra check, as with all escaping. Both are equally uncommon to be expected in existing Homie payloads. It's just that 0x00 is more dangerous, because of its meaning in the underlying system.

schaze commented 1 year ago

I am not sure I understand the issue with the 0x00 value on application vs system level. As stated by @simon-budig mqtt payloads are byte arrays. This specific part of the convention only applies to string properties, correct? In this case a homie library simply needs to check if it received a single byte with value 0 and then set the internal string representation of the property to an empty string. Same goes for publishing an empty string in reverse. Neither does the transmitted data have to be valid utf-8 or even have any relation to a string. It is simply a 0 byte and therefore should be interpreted as an empty string. @Tieske: do you have a more specific example where you might get even a theoretical issue with this approach?

Tieske commented 1 year ago

Many devices will be build on low-level hardware in which 0-terminated strings will be used. Not all Homie developers will be savvy enough to realize what this really means wrt to the placeholder being a 0 byte. And hence might run into weird bugs.

Sticking to 0x01 will prevent that from happening. It's a footgun waiting to happen.

That said maybe we should create a more elaborate placeholder, that is also easier to debug? both 0x00 and 0x01 are hard to send in a manual test trasnmission. And won't show in GUI's very nicely probably.

schaze commented 1 year ago

But wouldn't the 0x00 for these people not simply represent an empty string anyways - or do I just proof your point with my assumption? Do you really think someone writing code for an ESP or even a library will not be able to handle this specific case?

I agree with the debugging statement however. A 0 byte is probably not easy to send in a script or any general purpose UI like MQTT Explorer. It would also be the only part of the convention not using string values but instead binary values. However I do not have a better suggestion tbh. I had struggled with clearing a string property myself before as well and the current spec is lacking here currently.

Tieske commented 1 year ago

But wouldn't the 0x00 for these people not simply represent an empty string anyways

You cannot assume that. And it's dangerous precisely because people will assume that (as demonstrated in this discussion).

Let me turn it around; What is the draw-back of using 0x01?

schaze commented 1 year ago

You cannot assume that.

I really fail to see why but I think it is not worth the discussion - I am fine either way as long as there is a clear way to publish an empty string.

Tieske commented 1 year ago

@Thalhammer wdyt?

simon-budig commented 1 year ago

Let me elaborate a bit more... :-)

As a preface I'd like to postulate, that a good specification should be boring and should not present unnecessary surprises to the implementer. It should stick to existing conventions as much as possible and should take the most straightforward approach possible.

Now, using a 0x01 byte to represent an empty string to me is very much a surprise, and looking at the faces of my coworkers when I discussed this suggestion with them I don't seem to be alone there...

I don't know of any language or protocol implementation where the 0x01 is used in that sense. So, there is no precedence for that and I think that this is a strong argument against it.

Now, ideally we would be able to use an empty payload for an empty string, but the MQTT spec prevents us from doing that, so we need to figure out a workaround as non-disruptive as possible. And I believe that a single 0x00-byte is the best solution here - because it is precisely what all the languages in the C-family are using to represent the empty string. So for example you can just easily strncpy() the payload to a target memory without testing for the special case. You can simply use strnlen() on the payload to figure out its length, you can use strncpy() for comparisons, strndup() for duplicating the string... - it just works correctly as expected.

Using 0x01instead will break this. Without giving any additional convenience.

When I discovered this issue originally I was tempted to ask for zero-terminated strings in the payload, because this would be at some points more convenient for me as a C programmer (it saves some memcpy()ing to add a zero byte for further handling of the string). However, I realized that this additional zero byte would be cumbersome for a lot of other languages, making it extremly cumbersome to deal with "normal" string content. However, typically these other languages (I am thinking about python/javascript etc.) allow for embedding a zero byte in a string constant, so specifying a payload containing a single zero-byte is as easy as using a single 0x01 byte.

Personally I actually do believe that the homie convention should forbid "string" values with embedded zero-bytes. I don't see a real benefit for allowing them. We already restrict strings to the UTF8-encoding, so asking for strings to be simple in the sense that they must not contain zero-bytes (with the notable exception for representing the empty string, because the MQTT spec forces us to) would IMHO improve the spec. If an application absolutely needs payloads with zero-bytes we could offer a "bytearray" payload, where we even might lift the UTF8-requirement.

Tieske commented 1 year ago

0x00 it is then. See #251

homieiot / convention

Clarify handling of empty strings #223