peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
582 stars 127 forks source link

Support for MQTTv5 #127

Open bobveringa opened 1 year ago

bobveringa commented 1 year ago

We've been using this library for a while and have been really happy with the performance and usage. However, we recently took a look at how we could improve some of our cloud communication and discovered a need for MQTTv5 features like request-response, topic aliases and some of the expiry features.

From our perspective, it seems like we will need to develop these features anyway, but we would like to contribute these features back to the community. Given that this is (in our opinion) the go-to library for MicroPython MQTT support, It makes sense to work together to add support here.

Are you interested in adding support for MQTTv5? Our implementation could probably help from your experience with the current library, which would be beneficial for everyone using it.

peterhinch commented 1 year ago

In principle yes. When @kevinkk525 was involved we discussed this but never performed a detailed study of what was involved. Have you assessed the size of the problem in terms of effort and code size?

I do remember noting that V5 addresses the clean_init problem - if a node is downed for a long period then powered up with clean=False it can be overwhelmed with data. clean_init was our ad hoc fix.

A possible technical issue is RAM usage on ESP8266.

For personal reasons the amount of time I have available is quite limited.

bobveringa commented 1 year ago

As of right now, we are yet to decide on the scope for this project, and are just in the exploratory phase. This means that we don't have an idea of code size, and I think I labeled the effort internally as "a lot".

There are still some key questions to answer, like:

  1. Are we adding support to the existing mqtt_as.py file, or will this be a separate module (with compatible API?)
  2. From our initial (very brief) research, it wasn't 100% clear how compatible v3.1.1 and v5 are. If this mainly depended on the broker or if there are some other things.
    • If v3.1.1 brokers are compatible with v5 clients, then it would be possible to upgrade this library to only v5 support. But I'm finding mixed results online, so I'd just have to get a handful of clients and brokers to try this out.
  3. Which features to support, and how extensive that support should be.
  4. How to deal with the newer error/response codes that are returned
  5. RAM usage is an important one, our product only uses frozen byte code, so I must admit that I have no idea how big of an issue this would be.

So there is still a bit of work to do to figure all of that out. Any input you can provide on this is much appreciated.

I understand you have limited time available, I wouldn't ask you to start spending a lot of time on this topic. If you could provide some feedback from your lessons learned and other MicroPython knowledge during the development, that should (hopefully) be sufficient.

peterhinch commented 1 year ago

I would be glad to provide support as you describe.

Point 1. is clearly a major one which cannot be decided until there is a clearer view of the other points.

Re RAM use, mqtt_as was initially developed for ESP8266 as the only viable WiFi enabled platform. It has insufficient RAM to compile the module, so it must either be precompiled or preferably frozen. RAM is more plentiful on more modern platforms but people still use ESP8266. One option might be to retain the current code as a "legacy" branch for ESP8266 while advancing the master for V5.

bobveringa commented 12 months ago

Thoughts on v5 support

Okay, so I started this document out as a small, oh let's compare the 2 quickly and see what we get. And then well... I ended up with this. So, I'll summarize here and then have a very rough and general plan of what to do next.

It seems like it would be possible to support both 3.1.1 and v5 in the same class, with just a flag to indicate whether it is v5 or not. The reason for this is pretty simple, with the spec, they tried their best to keep things similar enough to make everything doable. So many changes are in the "variable header" of each packet type (both client to server and server to client). But all the v5 stuff comes after the 3.1.1 stuff. So this makes it fairly easy to add support to most methods. On the client processing side, we can however no longer rely on messages being a fixed size. And we will need to use the size attribute to read the entire packet, not that big of a deal either, the order of stuff hasn't changed.

The only thing that is really a lot different is the connect method. I think you can try to combine the functionality in a single method, but nobody would enjoy that. So this might have to be a method for 3.1.1 and v5 (or if we use inheritance, we can just override).

One thing is clear I think, a lot of stuff is now put as properties in the variable header. So, we need a method that can easily add these things in later. This is a bit difficult as these properties do change the length of the overall packet. For 3.1.1 this is not a problem as you only need to send a 0 byte.

So, after all that, here is my very rough, very general and basic plan for dealing with this:

  1. Just write a plain connect function that works, no will, no nothing, just connect (with certificate in my case)
  2. Update the receiving code to deal with packets of variable length (for suback, puback, etc)
  3. Update the sending code to indicate that they have no properties
  4. Slowly start to introduce the property of the connect function and write good utils to help the construction of the new packets.
  5. Introduces these new properties to other functions

As for support of the various features. Because most things are added in the variable header, if we come up with a clever smart way of exposing this to the user, we should be able to support most features, at least for publishing data. On the receiving side, I think the handler method would have to be changed to include some of these additional properties.

Based on how few brokers support the new Enhanced Authentication standard, I don't think we should worry about supporting it at this time. If this library is upgraded to the v5 standard, people who are advanced enough to be using those features can implement that themselves, and hopefully, contribute those changes back.

@peterhinch What do you think of this plan? You can also read my analysis below, maybe you have a different take than I do with how to proceed.

Comparison of 3.1.1 and v5

For this comparison, I used the following versions of the spec:

The goal of this analysis is to compare the specifications and see what changes need to be made to add support for MQTTv5.

MQTT Control Packets type changes

Most of the MQTT control packet types have remained the same. With 2 notable changes:

MQTT Control packet changes

CONNECT

Most of the CONNECT packet headers stays the same. They did introduce "CONNECT properties" which are encoded in the variable header. This is a breaking change, as you either need to include properties or specify that there are no properties (with a 0 byte).

The properties include:

There are also changes to the CONNECT payload. Most of these changes boil down to adding the same features, "publish" has to the will of the device. But this makes the payload different from the v3.1.1 spec in a difficult way, I think.

PUBLISH

The MQTTv5 introduced some changes to this packet, they are known in the spec as "PUBLISH properties" and they are part of the variable header for the packet. This seems to be a breaking change, as you must either include publish properties or explicitly specify that there are no properties. The paho.mqtt implementation of this is here. The example is the v5 spec is in figure 3-9 in section 3.3.2.3

The "publish properties" include a lot of the new features of the spec:

An interesting thing to note is the way topic aliases are handled. They are not a separate action that is executed, instead when publishing a message, you can include a topic alias (2 byte integer). In subsequent messages, the topic can be omitted and only the alias needs to be provided. However, this means that to have a guarantee that a topic alias is set, you need to be transmitting at least with QoS =1. Another interesting part is how distributed brokers handle these situations. The broker we use (AWS IoT Core) is highly distributed and the order of messages is "best effort". Given that the consequences of sending an invalid topic alias are the immediate termination of the connection by the broker (section 3.3.4 PUBLISH Actions) best for clients to thread lightly when using this feature. Additionally, topic alias only apply for the duration of the NETWORK connection, not the session, so if you lose the connection the aliases get lost.

But I don't think we need to think about that too much with the implementation. This library can essentially just serve as a wrapper for the protocol, and allow the user to specify the topic alias without any further processing.

When processing PUBLISH packets, there is no need to support topic aliases, and this can be indicated when connecting to the broker. So there should be only a little additional work on the processing side of this library (only the work in supporting publish properties).

PUBACK

The PUBACK packet has been expanded to include 2 new features: Reason codes and "PUBACK properties". They are appended to the variable header after the packet ID MSB and LSB see section 3.4.2 PUBACK Variable Header.

We could add support for the reason codes, but I don't really have a clear picture of where this information would go (other than a debug print). And I don't think I can fully test most of these features, as the AWS IoT Core broker is not that forgiving with most of these packets. For example, when you are not authorized to publish, the broker just immediately terminates the connection. But adding a debug print could be useful, and wouldn't be too much work.

There is also a reason string (a full UTF-8 string with a reason) and User Property field(s). But I'd say that for now we can ignore these features and keep things simple.

As for work on the library, now after a PUBACK only 2 bytes of the variable header are read (PID MSB and LSB). This needs to be modified so that the entire packet is read as it could now be variable due to properties that may, or may not, be included.

SUBSCRIBE

Much like the PUBLISH packet, the SUBSCRIBE packet was also expanded with additional properties, known in the spec as "SUBSCRIBE Properties" see section 3.8.2.1 SUBSCRIBE properties. These properties are, just like the "publish properties", appended at the end of the packet. This means that when subscribing as a v5 client, an additional byte needs to be sent that says that there are no properties.

The "subscribe properties" are:

Subscription Identifiers seem like they have a purpose, I think it when you pass this property, the publish packet the client receives might include this subscription identifier. You can then check the subscription identifier rather than trying to match the strings.

The v5 spec also introduces "Subscription Options" which are additional settings in what used to be the "Requested QoS" byte, see 3.8.3.1 Subscription Options.

The "subscription options" now contain the following settings:

  1. QoS
  2. Retain as published (seems to be related to forwarding messages?)
  3. Retain handling

Option 1 remains unchanged from previous versions. The retain as published might be useful, but given that it is just a single bit at subscribe time, doesn't hurt to implement it. Retain handling seems the most useful here. It allows the client to say what it wants to do with retained messages (0=normal, 1=send if not subscribed already, 2=don't send)

However, all the changes to the subscribe payload body are non-breaking. So leaving all these options at 0 is the same as current behavior.

SUBACK

With the SUBACK packet is much the same story as PUBACK They added new SUBACK properties. But these can all be ignored. They re-used the reason code they were already using to indicate what QoS the broker subscribed with.

There are changes in the variable header, so this needs to be handled in the same way as the PUBACK just making sure we read the entire packet, so things don't break.

UNSUBSCRIBE

The UNSUBSCRIBE packet mostly stays unchanged. However, they did add support for "UNSUBSCRIBE properties", these similar to SUBSCRIBE and PUBLISH, I think there is something wrong with the spec as it doesn't include a figure that shows the variable header, and the figures jump from 3.28 to 3.30. But given the description present for the v5 spec, you just need to send an additional byte indicating that there are no properties.

UNSUBACK

The UNSUBACK packet follows much the same story as the rest of the packets. They added User Properties, and reason codes. The packet does have payload now. But like other things, the PID is still in the first 2 bytes. The rest of the packet just needs to be read, but doesn't really need to be processed for an initial version.

PINGREQ / PINGRESP

No changes to these packets (Yay!)

DISCONNECT

The format of the DISCONNECT packet has not been changed and is fully compatible. There are some extensions in v5. For example, it is now possible for both the client and server to specify a reason for disconnecting. The "normal disconnection" is reason 0x00, which is the same value that the 3.1.1 requires. The full list of reason codes is listed in section 3.14.2.1 of the spec.

The server can now also send this packet, so this does need to be implemented. But it is a simple packet, so that is doable.

AUTH

The AUTH packet is a newly introduced packet in the v5 spec, it is part of the "Enhanced Authentication" introduced in v5, this article by HiveMQ is pretty good.

This packet does not need to be implemented unless we want to add support for enhanced authentication.

peterhinch commented 12 months ago

That is a very detailed review - it looks like a big job. A few broad-brush comments.

  1. The V5 client needs to be compatible with communications with arbitrary V5 clients. Any limitations to this need to be clearly identified.
  2. It is (in my opinion) OK to support a "micro" subset of V5 functionality so long as point 1 is maintained.
  3. Should we take this opportunity to support qos==2?
  4. Do you have a view on whether it is feasible to support V5 by subclassing? This could have the merit of saving RAM where users only need V3.1.1 support.
  5. Achieving resilience in the face of poor and intermittent WiFi was very time consuming and required a lot of testing. Options are either to maintain the existing mechanisms or to expect to have to repeat the testing. At one point late in the development of the library I was so alarmed by the complexity of this that I started a new project. The aim was simply to create a resilient socket-like object supporting communications between two endpoints. By the time I had achieved resilience it had acquired a mechanism of similar qualities and complexity. This is not intended to discourage you from revamping this, you may come up with something Kevin and I missed. But it is time consuming.
  6. @dpgeorge has expressed interest in MQTT on occasion. I think there is a case for a complete rewrite with objectives briefly outlined here. I have detected no sign that anyone is planning to do this.
bobveringa commented 11 months ago
  1. Do you mean that the V5 Client needs to be compatible with any V5 server? If not, could you clarify your point?
  2. The spec allows for a lot of flexibility, so, just adhering to the spec should allow us to maintain point 1.
  3. I, personally, don't have a need for this, as AWS IoT doesn't support it. So, I don't have an easy way to test this.
  4. I think it is feasible to add V5 support by subclassing, but not entirely, some changes will have to be made in the base class, but those are fairly limited I think, and shouldn't have an impact on RAM usage.
  5. (and .6) People are already used to the way the library works now, and the mechanisms have held up thus far. I also don't want to completely blow up the scope by reworking those systems (if not needed). A complete re-work of this package might be an option, but I don't currently have the kind of time to invest into that solution, as it would require a lot more re-testing. For now, the best strategy is probably to maintain as much of the existing code as possible to avoid having to re-test core parts of the library.

For the re-write (if this is something for a later date) I can start collecting data from the fleet of devices, we have deployed. Some are operating in truly horrific Wi-Fi conditions, which may produce interesting results. I intended to add this data collection anyway to our fleet just to help with support at the customer locations. So, if there is an interest in specific things that produce insights, I can see if it is possible to add those.

If there is interest in a re-write and I somehow find the time to properly look into that, I would like to help with the design and implementation. There are limitations we have run into with the current implementation, but they are minor enough that they don't warrant the engineering time spent on improving them. However, if it ever comes to a redesign, then we can allocate the required resources to investigate what these limitations are exactly, and maybe some possible solutions.

As for how to continue the MQTTv5 implementation. I think I have enough information on how to proceed at this time. I have some meetings next week that should (hopefully) provide me some insight into when there is enough space on the engineering calendar to start on the design and implementation.

peterhinch commented 11 months ago

Taking your points in order:

  1. I was thinking of IOT systems containing clients based on other software. We should declare any limitations on compatibility, such as lack of support for qos==2.
  2. OK.
  3. OK, I thought it worth mentioning. FWIW I don't remember qos==2 ever being requested. I did consider implementing it, testing using a local broker (mosquitto), but I'm entirely happy with leaving it unsupported.
  4. That sounds good.
  5. I mentioned the idea of a complete rewrite to provide context and to point out that there is a theoretical possibility that an official solution might emerge. I think it's unlikely that the maintainers will find the substantial amount of time required to do this. Meanwhile we have a working (if limited) solution, and upgrading it makes sense.

I'd be interested to hear details of your application (if you're in a position to divulge).

bobveringa commented 11 months ago

Alright, good to know your position on these things.

I'm happy to talk about the details of our application. Just not publicly. Do you have somewhere to reach you privately, be it a teams/voice call or just e-mail?

I understand, if you don't want to give out your information publicly (I would also rather not), if you send an e-mail to info@smartfloor.com. Your contact details should get to me (just mention my name in the e-mail).

peterhinch commented 9 months ago

FYI you might want to see https://github.com/peterhinch/micropython-mqtt/issues/132

bobveringa commented 8 months ago

Just an update on this. I've been very busy with other, more pressing tasks, but it is likely I will start to work on this over the next few months.

peterhinch commented 7 months ago

Thanks for the update.

bobveringa commented 7 months ago

I got bored over the Easter weekend and decided to give the implementation a go. Turns out getting most of it to work was not even that difficult. However, actually doing something with the new properties that are returned is difficult. Which is why, for now, all the properties that come from the broker, are ignored.

My current assessment, is that it would be very difficult to do something with the properties returned while adhering to the other goals that this library has. But I'll also discuss this internally to see what the options are, maybe I am just overlooking something.

While what I have now works, it does not look pretty. I'll spend the next couple of days cleaning up the code a bit and finalizing some of the features. You can expect an initial PR somewhere this week (or next week at the latest).

peterhinch commented 7 months ago

One issue I mentioned earlier is the clean_init arg discussed here in the docs. I provided this because I believed it to be necessary in a microcontroller context, although the way it is implemented is hacky. When Kevin and I discussed V5 we were relieved to see that the new version provided an official means of achieving this behaviour.

I would like to be able to remove the hack, ideally without breaking code that sets clean_init. I would be grateful if you could give this some thought and suggest a good way forward (whenever you get some time, of course).

bobveringa commented 7 months ago

Ah, I totally forgot about it.

Ok, I think something can be done, but only for MQTTv5, as just by reading the spec, there is no way to get the same behavior without this hack on MQTTv3.1.1.

Looking into this, I did find an answer to another question I had. The question I had was why rename Clean Session to Clean Start.

In the MQTTv3.1.1 spec, it says:

After the disconnection of a Session that had CleanSession set to 0, the Server MUST store further QoS 1 and QoS 2...

But this line is omitted in the MQTTv5 spec. And the additional clarification makes it clear that the disconnect hack is no longer needed in MQTTv5.

The code could look something like this:

            is_clean = self._clean
            if not self._has_connected and self._clean_init and not self._clean:
                is_clean = True
            await self._connect(is_clean)

But it becomes considerably uglier to have both MQTTv5 support and MQTTv3.1.1 support in the same function, as you would need to do something like this.

            is_clean = self._clean
            if not self._has_connected and self._clean_init and not self._clean:
                if self.mqttv5:
                    is_clean = True
                else:
                    await self._connect(True)  # Connect with clean session
                    try:
                        async with self.lock:
                            self._sock.write(b"\xe0\0")  # Force disconnect but keep socket open
                    except OSError:
                        pass
                    self.dprint("Waiting for disconnect")
                    await asyncio.sleep(2)  # Wait for broker to disconnect
                    self.dprint("About to reconnect with unclean session.")
            await self._connect(is_clean)

Also, a bit of an update regarding the status of this project internally. We have an ongoing internal discussion about doing a complete ground up rebuild for MQTTv5 support. We are still unsure about it is because we like working with open-source software that is actually used, it is just way less risk as more people are using it and more people are finding bugs. And we feel like splitting the community with our own thing is also not beneficial. For now, we will continue our development of MQTTv5 here as a rebuild and redesign will take a lot more time, so we will try our best to get as many features as reasonable added here.

peterhinch commented 7 months ago

For the benefit of your internal discussions it's worth pointing out that achieving resilience took a lot of time. It involved tests including outing brokers and AP's, putting a running client into a Faraday cage, slowly moving a running client out of wireless range and then back in. Deliberately creating radio noise. And repeating on a variety of host hardware.

As an aside my original industrial experience was in radio hardware development. Some software engineers find it hard to grasp that WiFi is subject to the laws of physics rather than conforming to the magic of TCP/IP...

At some point I'd appreciate an overview of your results. How much extra code and RAM use is involved? Do you think it is necessary to maintain support for V3.1.1? If the following conditions can be met, maybe V3.1.1 can be abandoned:

bobveringa commented 7 months ago

For the benefit of your internal discussions it's worth pointing out that achieving resilience took a lot of time.

This is also a big part of the reason why we don't want to move away from this library. We consider the Wi-Fi to be some sort of magic transportation layer that is not bound by any rules (be they programming or physics), and that looking at it the wrong way can change the behavior.

At some point I'd appreciate an overview of your results. How much extra code and RAM use is involved?

For now, it is pretty minimal (apart from actually using properties as you need to allocate memory for it).

Do you think it is necessary to maintain support for V3.1.1?

From what I can tell from the various brokers, most brokers seem to support at least basic features of MQTTv5. If support for v3.1.1 were to be dropped, it should be clear and to users that when they try to connect with the v5 client to the v3.1.1 broker, why this is not working and what to do to fix it (either using a specific branch or something else).

According to the spec, this is the behavior on v3.1.1 brokers.

The Server MUST respond to the CONNECT Packet with a CONNACK return code 0x01 (unacceptable protocol level) and then disconnect the Client if the Protocol Level is not supported by the Server

bobveringa commented 7 months ago

I have opened a PR for MQTTv5 support #139 It has been a lot more busy than I anticipated, so I have not had the time for a full cleanup. But the general concepts are in place.

I am not 100% happy with what I have right now, but after discussing this internally, this seemed like the least bad way of doing it. But I am open to any feedback.

bobveringa commented 4 months ago

@peterhinch Just writing this as a quick update to this topic.

We have continued working on adding MQTTv5 support to this library. We haven't published our latest work yet, as we are still working on finalizing everything. From my initial draft a couple of months ago, a lot has changed and most MQTTv5 features are supported.

There are still some open items which we are hoping to resolve within the next month. The most important one is that with MQTTv5, the broker can now disconnect clients by sending a disconnect packet. This still causes some unexpected behavior when reconnecting.

Overall, we are making good progress. I'll update here once we feel confident in the performance and stability of MQTTv5 support.

bobveringa commented 4 months ago

@peterhinch I've just pushed the latest update of our work on MQTTv5, it is available in the PR (#139). I would be interested in hearing your thoughts on the current implementation.

If you want to see a demo, I would be happy to set that up. This may be useful to show our implementation of topic aliases. We can arrange the further details of that through e-mail, if that is something you are interested in.

peterhinch commented 4 months ago

I've looked at the code and it seems good.

In order to review it properly I need to refresh my memory about V5 - we last looked at it nearly five years ago. I'll do that over the next couple of days. Meanwhile it would help if you could clarify any aspects of the protocol which are unsupported.

Assuming I don't throw a spanner in the works do you see this work as being nearly complete?

Re a demo that sounds a good idea, perhaps next week when I've got re-acquainted with V5. There are likely to e topics we could usefully discuss.

bobveringa commented 4 months ago

Unsupported Features

The new "Enhanced authentication" part of the v5 spec is not supported. The "AUTH" packet is not implemented. It is possible to set the Authentication Method and Authentication Data properties when connecting because, in general, properties are supported. And it doesn't make any sense to explicitly not allow them.

The "Will properties" are not supported. Implementing it would not be that difficult, but it would blow up the scope for testing.

For the subscribe packet, as part of "subscription options" we have not added support for the new options, and still only support setting the QoS when subscribing. This means we do not support: Retain Handling, RAP (Retain as Published) and NL (No Local). Adding support for this, would not be difficult (it is just some flags) but we opted out of implementing them as it would require a lot of additional testing.

The MQTTv5 spec allows you to set some properties more than once, like the "User Property". We opted to use a dictionary for handling the properties, which means that it is not possible to set a property more than once. This also applies when decoding properties, if the same property is included multiple times, we only take the last one.

Not everything that is returned in the CONNACK packet is exposed to the end user. This can be easily modified if it is required for new features.

I think this should cover all the unsupported features, but it is possible I might have missed something.

Uncompleted features

The last one is a bit on the edge of unsupported/not complete, this is something we can discuss. We currently don't do anything yet with the properties on received messages. There is still a TODO on line 807. The reason we have not yet implemented it yet is because we don't want to add a new argument to the callback or event, as that would be a breaking change.

There are several options available:

  1. Do the breaking change (we are not a fan of this)
  2. Do the breaking change, but only for MQTTv5. So if people want to use MQTTv5, they have to update callbacks and handlers and such.
  3. Do not support it (also not a great option)
  4. Try first with properties if that causes an error (because there are not enough arguments) then try without properties

I think it options 1 is out of the question. Not supporting it, is not really an option because some MQTTv5 features depend on it, like "subscription identifiers". So that basically leaves options 2 and 4. This does need some additional testing after deciding, but not a significant amount. We already handle all the properties, it is just that they are only printed at the moment.

We are leaning towards option 2, as MQTTv5 is already a bit of a breaking change, so it would not be unreasonable to ask that if people change the major version of a protocol they are using, that they also change a handler. But we have no strong opinions on either 2 or 4.

Usage of features

We decided to go with a fairly unopinionated implementation for some MQTTv5 features. Take for example topic aliases. This must be done by adding a topic alias as a property when publishing. In subsequent messages, you can then only include the topic alias property and omit the topic itself. In our current implementation, it is up to the application how to implement this. The library itself allows you to send the property, but you can't simply check something saying "use topic aliases" and have the library manage everything.

This is because, after doing some research, we found that there are just too many different use cases and edge cases that everyone would probably like to handle differently. The biggest problem with topic alias, for example, is that if you send a topic alias that is not valid, meaning that if the server has not received a topic to go with the alias before, the connection will be terminated. In our case, we are happy sending some messages with QoS 1, to ensure that messages that include both the topic and the alias arrive. But there are likely many people who want to make a different compromise.

Completion of the PR

It would say the work is nearly completed, there are some things to still review and receive feedback on, but I am not expecting major changes. We are still running validation checks internally, but we have devices that have been running some version of MQTTv5 for 2–3 months now, and everything has been very stable (apart from protocols errors in our application layer, which was why we found #147)

peterhinch commented 4 months ago

Thanks for that detailed response.

My initial response to "uncompleted features" above is as follows. Looking at the code, the user is able to choose the protocol. I assume that, if V5 is not selected, compatibility with the current version is 100%. Consequently, a user wishing to update the client but run existing code has a way to do so.

Consequently, if an application opts for V5, I think it is reasonable to introduce breaking changes.

If V5 introduces breaking changes, one question is whether, in that version, to remove support for clean_init. It's rather a hack to work round what I see as a weakness in the V3 protocol. That weakness has gone in V5 with session_expiry_interval so I'd happily see it removed.

I'll read up on MQTT5 with emphasis on the supported features, review the code, and report back in a few days.

bobveringa commented 4 months ago

Compatibility with the current version should be 100%, we are just starting re-validation of this, as we tested it a while ago but made some changes since then. I am not expecting that anything broke since we last tested it, but you never know.

I'll start the implementation of passing the properties to the handlers only for MQTTv5.

peterhinch commented 3 months ago

I've updated my local broker and done some work with mosquitto_pub and mosquittp_sub to re-acquaint myself with properties. Is https://github.com/bobveringa/micropython-mqtt/commit/63f20e716e7cd1b519552bc825965af801f15b26 essentially complete? If so I'll fetch the code and do some testing.

bobveringa commented 3 months ago

Yes, I'd say that version is essentially complete. I'm not expecting to make any additional changes (unless we find something during this review process).

peterhinch commented 3 months ago

I've grabbed the code. I was under the impression that existing applications would run unchanged, so my first test was to run the range_ex demo on an ESP8266, also on the ESP32 reference board (one aim being to measure the extra RAM usage on ESP8266).

In both cases it failed to connect to the broker (below is ESP32 output):

Connected to MicroPython at /dev/ttyUSB0
Use Ctrl-] or Ctrl-x to exit this shell

>>> import range_ex
Checking WiFi integrity.
Got reliable connection
Connecting to broker.
Connection failed.
>>> 

By contrast with the old version of mqtt_as.py on the same ESP32 board with no other changes:

>>> import range_ex
Checking WiFi integrity.
Got reliable connection
Connecting to broker.
Connected to broker.
We are connected to broker.
publish 0

Please could you indicate what changes are required to enable existing applications to run.

bobveringa commented 3 months ago

This not completely expected. We have run a number of tests with MQTTv5 but with all our old configurations, and everything works.

Could you give some more information about the setup you are running, I'll investigate what is going on.

bobveringa commented 3 months ago

I did a code analysis in addition to some other testing. When a last will was set, it was also adding will properties, this messed up the entire encoding of the packet (for 3.1.1) thus resulting in being unable to connect.

This somehow got by everyone and managed to stay in there for several code reviews.

We don't use 3.1.1 in combination with last will, so it was never discovered during testing.

https://github.com/peterhinch/micropython-mqtt/pull/139/commits/7d7743134254bc757474ae9f48a26519993929d3 adds a fix for this issue. I've analyzed the rest of the code, and it looks like this was the only instance of this happening.

peterhinch commented 3 months ago

Thank you, that fixes the problem.

RAM usage

On an ESP8266 free RAM is as follows (using precompiled mqtt_as.py, running range_ex.py). This was with no code changes, simply replacing mqtt_as.mpy.
Current version: 14752 bytes.
New version: 10416 bytes. Delta = 4336 bytes.

Running MQTT V5

I then modified range_ex.py to specify mqttv5

config["mqttv5"] = True

and adapted line 54 to read

async for topic, msg, retained, properties in client.queue:

The demo ran OK until I hit the broker with

$ mosquitto_pub -h 192.168.0.10 -t foo_topic -q 1 -m "hello hello" -D PUBLISH user-property foo bar  -V 5

The following occurred:

[correct tracebacks removed] 
publish 18
publish 19
publish 20
Task exception wasn't retrieved
future: <Task> coro= <generator object '_handle_msg' at 3ffd49d0>
Traceback (most recent call last):
  File "asyncio/core.py", line 1, in run_until_complete
  File "mqtt_as.py", line 967, in _handle_msg
  File "mqtt_as.py", line 810, in wait_msg
  File "mqtt_as.py", line 266, in decode_properties
ValueError: Unknown property identifier: 0
publish 21

Note that a mosquitto_sub session responded correctly to the publication:

$ mosquitto_sub -h 192.168.0.10 -t foo_topic  -V 5 -F "Properties %P payload %p"
Properties foo:bar payload hello hello

I have no doubt I'm missing something here - I'd be grateful for some guidance.

bobveringa commented 3 months ago

You seem to be pressing just where it hurts (which is a good thing) and seem to have uncovered some gaps in our testing.

It seems like the decode properties table used the wrong decode method for user properties, it was marked as a string, but it is actually a string pair. The latest commit fixes this.

Some refactoring seems to have caused the wrong method to be used. My apologies, this should not have made it in the released version.

I've validated that all the other properties are using the correct decoding methods.

peterhinch commented 3 months ago

To throw in a curveball I've been thinking about RAM usage. The overhead when running MQTT V3.1.1 applications is rather high, notably for Pico W and especially for ESP8266. Users wanting V5 are probably best advised to avoid minimal-RAM platforms where the overhead is entirely OK in my view. But there are many ESP8266 users, and users who have no need for V5.

I can see two options, but there may be others.

  1. Maintain the existing release code for 3.1.1 alongside the new version. Users on restricted platforms being invited to use the old protocol. Drawback: two codebases to maintain. However existing code is mature and seldom needs updating.
  2. Place all the property handling code, from format_properties to the start of the base class, in a second file. The base class constructor only imports the file if mqttv5 is set. A wildcard import ensures minimal (no?) code changes.

Unless you have another alternative I think 2. is worth an experiment. I'm willing to give this a try, measuring the outcome, if you think it's a reasonable approach.

bobveringa commented 3 months ago

Option 2 seems like a reasonable approach. I agree that people working on memory constraint platforms should not be impacted by a change this big. Based on the changes, it seems reasonable to assume that the bulk of this additional memory usage is coming from property decoding.

Just to confirm, you mentioned the following:

Current version: 14752 bytes. New version: 10416 bytes. Delta = 4336 bytes.

I'm going to assume that these should be flipped around, and that the current (released) version is the one using the least memory. Even though it would be nice if this new release somehow used less memory 😄.

peterhinch commented 3 months ago

The latest update has fixed received messages with user properties. I'm a bit stumped by the syntax required for the properties arg of the publish command. By symmetry with the received properties I'm using

        await client.publish(
            TOPIC,
            s.format(n, client.REPUB_COUNT, outages, rssi, m, client.queue.discards),
            qos=1,
            properties={38: {b"abc": b"def"}},
        )

but am getting

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "range_ex.py", line 121, in <module>
  File "asyncio/core.py", line 1, in run
  File "asyncio/core.py", line 1, in run_until_complete
  File "asyncio/core.py", line 1, in run_until_complete
  File "range_ex.py", line 101, in main
  File "mqtt_as.py", line 1089, in publish
  File "mqtt_as.py", line 613, in publish
  File "mqtt_as.py", line 638, in _publish
  File "mqtt_as.py", line 171, in format_properties
NotImplementedError: array/bytes required on right side

Please advise as to the correct syntax.

bobveringa commented 3 months ago

There is no encoding method for properties. When sending properties, it must be formatted according to the spec. In the case of user properties, this means the following.

            properties = {
                38: b'\x00\x03abc\x00\x03def',
            }

If the property is a string, then it must be formatted according to the spec (16 bit int for length, followed by the string)

bobveringa commented 3 months ago

I've deleted my previous comment, as after some more research it did not describe everything correctly.

As a correction, if a property is a UTF-8 String, then all that is needed to send it as send it as a "normal" python string. Our expectation at the time was that, this should cover the annoying bits of sending properties. As other properties are mostly fairly trivial.

Looking at the driver in a more complete state now, you could make an argument that users should not need to have knowledge of how the spec requires certain properties to be formatted, or at least not to the degree required now.

However, implementing property encoding is a bit more difficult, as a lot has to be known about them. We really want to prevent heap fragmentation and memory reallocation if at all possible. As we have noticed that this has significant performance and runtime impacts on our application. This is also the reason why format_properties looks that way that it does.

It is possible to implement a more general encode function that would require less knowledge from the user about how to format the properties. However, it would be a lot more complex to implement.

So, this then turns into 2 questions:

  1. Is properly encoding properties something the library should do
  2. How much effort can be spent doing that.

If I had to answer those questions myself.

  1. Maybe, it would be a nicer user experience, and prevents knowledge of the spec making its way to the application
  2. I don't know

I would like to hear your thoughts on the matter.

peterhinch commented 3 months ago

Just to confirm, you mentioned the following: Current version: 14752 bytes. New version: 10416 bytes. Delta = 4336 bytes.

The perils of written communication :) I should have clarified those are measures of free RAM while running range_ex. The ESP8266 requires precompilation even with the release version. Frozen bytecode gives a substantial saving, but many users are not equipped to do this. Hence my concern.

I'll think about property encoding.

[EDIT] I have hacked a version with the property code in a separate file. Performing the same test on an ESP8266 the free RAM is 13824 bytes. This is with the two files pre-compiled. The overhead for running old code on the new version is now only 928 bytes which is OK in my opinion.

bobveringa commented 3 months ago

Good to know. I'll make that change in the PR.

While working in the garden, I also thought of a way to better encode properties so that it is more user-friendly. Based on some mental programming, I think it doesn't cause any more memory re-allocation or fragmentation than the current solution. But programs do tend to run better in my head than on my hardware, so we'll see how that goes.

I'll also make those changes, test them to see if I didn't break anything, and I'll update the syntax here when I've pushed the new version.

bobveringa commented 3 months ago

I had to fight a bit with including the separate file, so I made it a relative import, I think in that case it should work for everyone, if they have it in a separate folder or not. If that is not the case, I have to think of something better to include it.

As for how properties are now encoded. You still need to know the spec, but it is not as relevant as it was before. Now integers can just be passed as regular integers, regardless if they are encoded as variable length int, 2 byte or 4 byte int.

Single byte properties still need to be passed as such.

Binary data, has to be a bytearray. But the 16-bit length at the front is automatically added when encoding. UTF-8 Strings can be passed as regular strings. UTF-8 String pairs are a dictionary of strings.

           properties = {
                0x26: {'value': 'test'},
                0x09: b'correlation_data',
                0x08: 'response_topic',
                0x02: 60,
            }
peterhinch commented 3 months ago

I made it a relative import

That sounds fine, but see my comment to the PR.

The new way to specify properties is much more intuitive.

peterhinch commented 3 months ago

A question remains, what to do about incoming properties. From an API point of view it makes sense for decoded properties to be returned, e.g.

properties = await client.subscribe("foo_topic", 1)

that way the caller can process them or ignore them. In the case of connect this is easy. The following is a broad-brush, untested, idea of how we might achieve this.

In the case of publish and subscribe there can be repeated attempts - I guess the properties of the final, successful, attempt should be returned. This could be done with a bound dict .pend_props whose key is the PID and the value the decoded properties. This holds the properties in the interval between the incoming message being received and the calling task being re-scheduled.

The .wait_msg would assign to the dict and ._await_pid would retrieve the properties. Under normal conditions the dict would hold only one entry, being removed when the calling routine got a successful result from ._await_pid or on failure. Multiple entries could briefly occur with overlapping concurrent operations.

async def _await_pid(self, pid):
    t = ticks_ms()
    while pid in self.rcv_pids:  # local copy
        if self._timeout(t) or not self.isconnected():
            self.rcv_pids.discard(pid)  # This should have been in my code
            del self.pend_props[pid]  # Discard old properties
            break  # Must repub or bail out
        await asyncio.sleep_ms(100)
    else:
        return self.pend_props.pop(pid, True)  # PID received. All done.
    return False

with .wait_msg having code like

async def wait_msg(self):
    # ...
    if puback_props_sz > 0:
        puback_props = await self._as_read(puback_props_sz)
        #decoded_props = decode_properties(puback_props, puback_props_sz)
        #self.dprint("PUBACK properties %s", decoded_props)
        self.pend_props[pid] = decode_properties(puback_props, puback_props_sz)

Re DISCONNECT In this case there is no PID since DISCONNECT can originate from the app or from the broker. I think the best approach is a bound variable disconnect_properties set to None in the constructor and also on connect. A disconnection caused by the broker will be trapped by the existing mechanism; the application's down task could read this bound variable.

Re UNSUBSCRIBE Currently we ignore the UNSUBACK packet (0xB0). According to the spec it can carry the user property (0x26) but I've no idea whether brokers make a habit of this and what its contents would mean. There is a reason code. I don't know whether we need to deal with this, but given that there is a PID the property could be handled as per publish and subscribe above.

I'd be keen to hear your views.

bobveringa commented 3 months ago

I don't complete understand what you mean by supporting decoded properties for "PUBLISH". Incoming properties for "PUBLISH" are already supported as discussed here https://github.com/peterhinch/micropython-mqtt/issues/127#issuecomment-2247864055

Properties on 2 other operations "CONNACK" and "DISCONNECT" or sort of supported. With connack the relevant properties are taken out and made available to users. And with the disconnect the reason string is printed for easier debugging.

As for my thoughts on supporting properties on these other operations. I am not convinced that is something we have to do. The properties that these operations have are not really relevant in my opinion. It seems to mostly consist of 2 properties. Reason String and User Property. In my opinion, the benefit that you can get from returning these properties does not outweigh the complexity that would be introduced with a change like this.

While working on MQTTv5 support together with my colleague, with had a lot of internal discussions about what to do with properties. Looking around at other libraries and code that is publicly available, people do not seem to care about properties on these other operations. Bigger MQTT libraries like paho.mqtt, also don't seem to expose properties on these other operations.

In my opinion, supporting properties on incoming PUBLISH packets is what most if not all people are actually interested in. Supporting and exposing some properties on CONNACK is needed for the library to function correctly. The rest is all a nice to have.

peterhinch commented 3 months ago

I appreciate that properties on incoming messages are supported.

I was querying those cases which currently produce only debug prints (PUBACK, UNSUBACK,DISCONNECT, CONNACK). However I take your point that there are good reasons for not attempting full support, and on further thought there are edge cases that make doing it nontrivial.

bobveringa commented 3 months ago

A summary from our call yesterday (again because I posted the other one with my wrong account)

The remaining work consists of adding documentation for the MQTTv5 implementation. This should include some examples and a list of the features that are not supported. I will add this documentation and submit it with the existing PR.

We decided against returning properties for operations like PUBACK, SUBACK, DISCONNECT, etc. because it is unclear if there is any desire for this. We determined that if there is a need to implement this later, it can be done without any breaking changes to the API.

bobveringa commented 3 months ago

@peterhinch I see #139 was just merged. That is a bit of a problem. My colleague just tried out the latest version and got the following error.

  File "lib/mqtt_as/mqtt_as.py", line 795, in connect
  File "lib/mqtt_as/mqtt_as.py", line 317, in _connect
NameError: name 'encode_properties' isn't defined

I have no idea how because both you and I tested this method of importing properties from an external file. But yes, it is now happening consistently.

I have a "solution". I could open another PR for this?

encode_properties = None
decode_properties = None

class MQTT_base:
    def __init__(self, config):
        ...
        if self.mqttv5:
            global encode_properties, decode_properties
            from .mqtt_v5_properties import encode_properties as ep
            from .mqtt_v5_properties import decode_properties as dp
            encode_properties = ep
            decode_properties = dp

EDIT: PR is open #148

peterhinch commented 3 months ago

In our meeting you asked me where I saw this going. I've now had a chance to give this some thought.

Main branch

New branch

I intend to experiment with returning properties from functions as discussed. While I take your point about e.g. PUBUCK properties being little used, I do think it's worth at least trying that approach. It strikes me as a logical API. It might even save bytes in the crucial case of V3.1.1 use on ESP8266 because we would be removing debug print statements.

Even if I judge the test a success I won't merge the branch until you've had a chance to review it.

Comments on this plan are welcome.

bobveringa commented 3 months ago

These all seem like good points.

I'd be happy to take a look at any tests you intend to do.