peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
549 stars 116 forks source link

Support for MQTTv5 #127

Open bobveringa opened 7 months ago

bobveringa commented 7 months ago

We've been using this library for a while and have been really happy with the performance and usage. However, we recently took a look at how we could improve some of our cloud communication and discovered a need for MQTTv5 features like request-response, topic aliases and some of the expiry features.

From our perspective, it seems like we will need to develop these features anyway, but we would like to contribute these features back to the community. Given that this is (in our opinion) the go-to library for MicroPython MQTT support, It makes sense to work together to add support here.

Are you interested in adding support for MQTTv5? Our implementation could probably help from your experience with the current library, which would be beneficial for everyone using it.

peterhinch commented 7 months ago

In principle yes. When @kevinkk525 was involved we discussed this but never performed a detailed study of what was involved. Have you assessed the size of the problem in terms of effort and code size?

I do remember noting that V5 addresses the clean_init problem - if a node is downed for a long period then powered up with clean=False it can be overwhelmed with data. clean_init was our ad hoc fix.

A possible technical issue is RAM usage on ESP8266.

For personal reasons the amount of time I have available is quite limited.

bobveringa commented 7 months ago

As of right now, we are yet to decide on the scope for this project, and are just in the exploratory phase. This means that we don't have an idea of code size, and I think I labeled the effort internally as "a lot".

There are still some key questions to answer, like:

  1. Are we adding support to the existing mqtt_as.py file, or will this be a separate module (with compatible API?)
  2. From our initial (very brief) research, it wasn't 100% clear how compatible v3.1.1 and v5 are. If this mainly depended on the broker or if there are some other things.
    • If v3.1.1 brokers are compatible with v5 clients, then it would be possible to upgrade this library to only v5 support. But I'm finding mixed results online, so I'd just have to get a handful of clients and brokers to try this out.
  3. Which features to support, and how extensive that support should be.
  4. How to deal with the newer error/response codes that are returned
  5. RAM usage is an important one, our product only uses frozen byte code, so I must admit that I have no idea how big of an issue this would be.

So there is still a bit of work to do to figure all of that out. Any input you can provide on this is much appreciated.

I understand you have limited time available, I wouldn't ask you to start spending a lot of time on this topic. If you could provide some feedback from your lessons learned and other MicroPython knowledge during the development, that should (hopefully) be sufficient.

peterhinch commented 7 months ago

I would be glad to provide support as you describe.

Point 1. is clearly a major one which cannot be decided until there is a clearer view of the other points.

Re RAM use, mqtt_as was initially developed for ESP8266 as the only viable WiFi enabled platform. It has insufficient RAM to compile the module, so it must either be precompiled or preferably frozen. RAM is more plentiful on more modern platforms but people still use ESP8266. One option might be to retain the current code as a "legacy" branch for ESP8266 while advancing the master for V5.

bobveringa commented 7 months ago

Thoughts on v5 support

Okay, so I started this document out as a small, oh let's compare the 2 quickly and see what we get. And then well... I ended up with this. So, I'll summarize here and then have a very rough and general plan of what to do next.

It seems like it would be possible to support both 3.1.1 and v5 in the same class, with just a flag to indicate whether it is v5 or not. The reason for this is pretty simple, with the spec, they tried their best to keep things similar enough to make everything doable. So many changes are in the "variable header" of each packet type (both client to server and server to client). But all the v5 stuff comes after the 3.1.1 stuff. So this makes it fairly easy to add support to most methods. On the client processing side, we can however no longer rely on messages being a fixed size. And we will need to use the size attribute to read the entire packet, not that big of a deal either, the order of stuff hasn't changed.

The only thing that is really a lot different is the connect method. I think you can try to combine the functionality in a single method, but nobody would enjoy that. So this might have to be a method for 3.1.1 and v5 (or if we use inheritance, we can just override).

One thing is clear I think, a lot of stuff is now put as properties in the variable header. So, we need a method that can easily add these things in later. This is a bit difficult as these properties do change the length of the overall packet. For 3.1.1 this is not a problem as you only need to send a 0 byte.

So, after all that, here is my very rough, very general and basic plan for dealing with this:

  1. Just write a plain connect function that works, no will, no nothing, just connect (with certificate in my case)
  2. Update the receiving code to deal with packets of variable length (for suback, puback, etc)
  3. Update the sending code to indicate that they have no properties
  4. Slowly start to introduce the property of the connect function and write good utils to help the construction of the new packets.
  5. Introduces these new properties to other functions

As for support of the various features. Because most things are added in the variable header, if we come up with a clever smart way of exposing this to the user, we should be able to support most features, at least for publishing data. On the receiving side, I think the handler method would have to be changed to include some of these additional properties.

Based on how few brokers support the new Enhanced Authentication standard, I don't think we should worry about supporting it at this time. If this library is upgraded to the v5 standard, people who are advanced enough to be using those features can implement that themselves, and hopefully, contribute those changes back.

@peterhinch What do you think of this plan? You can also read my analysis below, maybe you have a different take than I do with how to proceed.

Comparison of 3.1.1 and v5

For this comparison, I used the following versions of the spec:

The goal of this analysis is to compare the specifications and see what changes need to be made to add support for MQTTv5.

MQTT Control Packets type changes

Most of the MQTT control packet types have remained the same. With 2 notable changes:

MQTT Control packet changes

CONNECT

Most of the CONNECT packet headers stays the same. They did introduce "CONNECT properties" which are encoded in the variable header. This is a breaking change, as you either need to include properties or specify that there are no properties (with a 0 byte).

The properties include:

There are also changes to the CONNECT payload. Most of these changes boil down to adding the same features, "publish" has to the will of the device. But this makes the payload different from the v3.1.1 spec in a difficult way, I think.

PUBLISH

The MQTTv5 introduced some changes to this packet, they are known in the spec as "PUBLISH properties" and they are part of the variable header for the packet. This seems to be a breaking change, as you must either include publish properties or explicitly specify that there are no properties. The paho.mqtt implementation of this is here. The example is the v5 spec is in figure 3-9 in section 3.3.2.3

The "publish properties" include a lot of the new features of the spec:

An interesting thing to note is the way topic aliases are handled. They are not a separate action that is executed, instead when publishing a message, you can include a topic alias (2 byte integer). In subsequent messages, the topic can be omitted and only the alias needs to be provided. However, this means that to have a guarantee that a topic alias is set, you need to be transmitting at least with QoS =1. Another interesting part is how distributed brokers handle these situations. The broker we use (AWS IoT Core) is highly distributed and the order of messages is "best effort". Given that the consequences of sending an invalid topic alias are the immediate termination of the connection by the broker (section 3.3.4 PUBLISH Actions) best for clients to thread lightly when using this feature. Additionally, topic alias only apply for the duration of the NETWORK connection, not the session, so if you lose the connection the aliases get lost.

But I don't think we need to think about that too much with the implementation. This library can essentially just serve as a wrapper for the protocol, and allow the user to specify the topic alias without any further processing.

When processing PUBLISH packets, there is no need to support topic aliases, and this can be indicated when connecting to the broker. So there should be only a little additional work on the processing side of this library (only the work in supporting publish properties).

PUBACK

The PUBACK packet has been expanded to include 2 new features: Reason codes and "PUBACK properties". They are appended to the variable header after the packet ID MSB and LSB see section 3.4.2 PUBACK Variable Header.

We could add support for the reason codes, but I don't really have a clear picture of where this information would go (other than a debug print). And I don't think I can fully test most of these features, as the AWS IoT Core broker is not that forgiving with most of these packets. For example, when you are not authorized to publish, the broker just immediately terminates the connection. But adding a debug print could be useful, and wouldn't be too much work.

There is also a reason string (a full UTF-8 string with a reason) and User Property field(s). But I'd say that for now we can ignore these features and keep things simple.

As for work on the library, now after a PUBACK only 2 bytes of the variable header are read (PID MSB and LSB). This needs to be modified so that the entire packet is read as it could now be variable due to properties that may, or may not, be included.

SUBSCRIBE

Much like the PUBLISH packet, the SUBSCRIBE packet was also expanded with additional properties, known in the spec as "SUBSCRIBE Properties" see section 3.8.2.1 SUBSCRIBE properties. These properties are, just like the "publish properties", appended at the end of the packet. This means that when subscribing as a v5 client, an additional byte needs to be sent that says that there are no properties.

The "subscribe properties" are:

Subscription Identifiers seem like they have a purpose, I think it when you pass this property, the publish packet the client receives might include this subscription identifier. You can then check the subscription identifier rather than trying to match the strings.

The v5 spec also introduces "Subscription Options" which are additional settings in what used to be the "Requested QoS" byte, see 3.8.3.1 Subscription Options.

The "subscription options" now contain the following settings:

  1. QoS
  2. Retain as published (seems to be related to forwarding messages?)
  3. Retain handling

Option 1 remains unchanged from previous versions. The retain as published might be useful, but given that it is just a single bit at subscribe time, doesn't hurt to implement it. Retain handling seems the most useful here. It allows the client to say what it wants to do with retained messages (0=normal, 1=send if not subscribed already, 2=don't send)

However, all the changes to the subscribe payload body are non-breaking. So leaving all these options at 0 is the same as current behavior.

SUBACK

With the SUBACK packet is much the same story as PUBACK They added new SUBACK properties. But these can all be ignored. They re-used the reason code they were already using to indicate what QoS the broker subscribed with.

There are changes in the variable header, so this needs to be handled in the same way as the PUBACK just making sure we read the entire packet, so things don't break.

UNSUBSCRIBE

The UNSUBSCRIBE packet mostly stays unchanged. However, they did add support for "UNSUBSCRIBE properties", these similar to SUBSCRIBE and PUBLISH, I think there is something wrong with the spec as it doesn't include a figure that shows the variable header, and the figures jump from 3.28 to 3.30. But given the description present for the v5 spec, you just need to send an additional byte indicating that there are no properties.

UNSUBACK

The UNSUBACK packet follows much the same story as the rest of the packets. They added User Properties, and reason codes. The packet does have payload now. But like other things, the PID is still in the first 2 bytes. The rest of the packet just needs to be read, but doesn't really need to be processed for an initial version.

PINGREQ / PINGRESP

No changes to these packets (Yay!)

DISCONNECT

The format of the DISCONNECT packet has not been changed and is fully compatible. There are some extensions in v5. For example, it is now possible for both the client and server to specify a reason for disconnecting. The "normal disconnection" is reason 0x00, which is the same value that the 3.1.1 requires. The full list of reason codes is listed in section 3.14.2.1 of the spec.

The server can now also send this packet, so this does need to be implemented. But it is a simple packet, so that is doable.

AUTH

The AUTH packet is a newly introduced packet in the v5 spec, it is part of the "Enhanced Authentication" introduced in v5, this article by HiveMQ is pretty good.

This packet does not need to be implemented unless we want to add support for enhanced authentication.

peterhinch commented 7 months ago

That is a very detailed review - it looks like a big job. A few broad-brush comments.

  1. The V5 client needs to be compatible with communications with arbitrary V5 clients. Any limitations to this need to be clearly identified.
  2. It is (in my opinion) OK to support a "micro" subset of V5 functionality so long as point 1 is maintained.
  3. Should we take this opportunity to support qos==2?
  4. Do you have a view on whether it is feasible to support V5 by subclassing? This could have the merit of saving RAM where users only need V3.1.1 support.
  5. Achieving resilience in the face of poor and intermittent WiFi was very time consuming and required a lot of testing. Options are either to maintain the existing mechanisms or to expect to have to repeat the testing. At one point late in the development of the library I was so alarmed by the complexity of this that I started a new project. The aim was simply to create a resilient socket-like object supporting communications between two endpoints. By the time I had achieved resilience it had acquired a mechanism of similar qualities and complexity. This is not intended to discourage you from revamping this, you may come up with something Kevin and I missed. But it is time consuming.
  6. @dpgeorge has expressed interest in MQTT on occasion. I think there is a case for a complete rewrite with objectives briefly outlined here. I have detected no sign that anyone is planning to do this.
bobveringa commented 6 months ago
  1. Do you mean that the V5 Client needs to be compatible with any V5 server? If not, could you clarify your point?
  2. The spec allows for a lot of flexibility, so, just adhering to the spec should allow us to maintain point 1.
  3. I, personally, don't have a need for this, as AWS IoT doesn't support it. So, I don't have an easy way to test this.
  4. I think it is feasible to add V5 support by subclassing, but not entirely, some changes will have to be made in the base class, but those are fairly limited I think, and shouldn't have an impact on RAM usage.
  5. (and .6) People are already used to the way the library works now, and the mechanisms have held up thus far. I also don't want to completely blow up the scope by reworking those systems (if not needed). A complete re-work of this package might be an option, but I don't currently have the kind of time to invest into that solution, as it would require a lot more re-testing. For now, the best strategy is probably to maintain as much of the existing code as possible to avoid having to re-test core parts of the library.

For the re-write (if this is something for a later date) I can start collecting data from the fleet of devices, we have deployed. Some are operating in truly horrific Wi-Fi conditions, which may produce interesting results. I intended to add this data collection anyway to our fleet just to help with support at the customer locations. So, if there is an interest in specific things that produce insights, I can see if it is possible to add those.

If there is interest in a re-write and I somehow find the time to properly look into that, I would like to help with the design and implementation. There are limitations we have run into with the current implementation, but they are minor enough that they don't warrant the engineering time spent on improving them. However, if it ever comes to a redesign, then we can allocate the required resources to investigate what these limitations are exactly, and maybe some possible solutions.

As for how to continue the MQTTv5 implementation. I think I have enough information on how to proceed at this time. I have some meetings next week that should (hopefully) provide me some insight into when there is enough space on the engineering calendar to start on the design and implementation.

peterhinch commented 6 months ago

Taking your points in order:

  1. I was thinking of IOT systems containing clients based on other software. We should declare any limitations on compatibility, such as lack of support for qos==2.
  2. OK.
  3. OK, I thought it worth mentioning. FWIW I don't remember qos==2 ever being requested. I did consider implementing it, testing using a local broker (mosquitto), but I'm entirely happy with leaving it unsupported.
  4. That sounds good.
  5. I mentioned the idea of a complete rewrite to provide context and to point out that there is a theoretical possibility that an official solution might emerge. I think it's unlikely that the maintainers will find the substantial amount of time required to do this. Meanwhile we have a working (if limited) solution, and upgrading it makes sense.

I'd be interested to hear details of your application (if you're in a position to divulge).

bobveringa commented 6 months ago

Alright, good to know your position on these things.

I'm happy to talk about the details of our application. Just not publicly. Do you have somewhere to reach you privately, be it a teams/voice call or just e-mail?

I understand, if you don't want to give out your information publicly (I would also rather not), if you send an e-mail to info@smartfloor.com. Your contact details should get to me (just mention my name in the e-mail).

peterhinch commented 4 months ago

FYI you might want to see https://github.com/peterhinch/micropython-mqtt/issues/132

bobveringa commented 3 months ago

Just an update on this. I've been very busy with other, more pressing tasks, but it is likely I will start to work on this over the next few months.

peterhinch commented 3 months ago

Thanks for the update.

bobveringa commented 3 months ago

I got bored over the Easter weekend and decided to give the implementation a go. Turns out getting most of it to work was not even that difficult. However, actually doing something with the new properties that are returned is difficult. Which is why, for now, all the properties that come from the broker, are ignored.

My current assessment, is that it would be very difficult to do something with the properties returned while adhering to the other goals that this library has. But I'll also discuss this internally to see what the options are, maybe I am just overlooking something.

While what I have now works, it does not look pretty. I'll spend the next couple of days cleaning up the code a bit and finalizing some of the features. You can expect an initial PR somewhere this week (or next week at the latest).

peterhinch commented 2 months ago

One issue I mentioned earlier is the clean_init arg discussed here in the docs. I provided this because I believed it to be necessary in a microcontroller context, although the way it is implemented is hacky. When Kevin and I discussed V5 we were relieved to see that the new version provided an official means of achieving this behaviour.

I would like to be able to remove the hack, ideally without breaking code that sets clean_init. I would be grateful if you could give this some thought and suggest a good way forward (whenever you get some time, of course).

bobveringa commented 2 months ago

Ah, I totally forgot about it.

Ok, I think something can be done, but only for MQTTv5, as just by reading the spec, there is no way to get the same behavior without this hack on MQTTv3.1.1.

Looking into this, I did find an answer to another question I had. The question I had was why rename Clean Session to Clean Start.

In the MQTTv3.1.1 spec, it says:

After the disconnection of a Session that had CleanSession set to 0, the Server MUST store further QoS 1 and QoS 2...

But this line is omitted in the MQTTv5 spec. And the additional clarification makes it clear that the disconnect hack is no longer needed in MQTTv5.

The code could look something like this:

            is_clean = self._clean
            if not self._has_connected and self._clean_init and not self._clean:
                is_clean = True
            await self._connect(is_clean)

But it becomes considerably uglier to have both MQTTv5 support and MQTTv3.1.1 support in the same function, as you would need to do something like this.

            is_clean = self._clean
            if not self._has_connected and self._clean_init and not self._clean:
                if self.mqttv5:
                    is_clean = True
                else:
                    await self._connect(True)  # Connect with clean session
                    try:
                        async with self.lock:
                            self._sock.write(b"\xe0\0")  # Force disconnect but keep socket open
                    except OSError:
                        pass
                    self.dprint("Waiting for disconnect")
                    await asyncio.sleep(2)  # Wait for broker to disconnect
                    self.dprint("About to reconnect with unclean session.")
            await self._connect(is_clean)

Also, a bit of an update regarding the status of this project internally. We have an ongoing internal discussion about doing a complete ground up rebuild for MQTTv5 support. We are still unsure about it is because we like working with open-source software that is actually used, it is just way less risk as more people are using it and more people are finding bugs. And we feel like splitting the community with our own thing is also not beneficial. For now, we will continue our development of MQTTv5 here as a rebuild and redesign will take a lot more time, so we will try our best to get as many features as reasonable added here.

peterhinch commented 2 months ago

For the benefit of your internal discussions it's worth pointing out that achieving resilience took a lot of time. It involved tests including outing brokers and AP's, putting a running client into a Faraday cage, slowly moving a running client out of wireless range and then back in. Deliberately creating radio noise. And repeating on a variety of host hardware.

As an aside my original industrial experience was in radio hardware development. Some software engineers find it hard to grasp that WiFi is subject to the laws of physics rather than conforming to the magic of TCP/IP...

At some point I'd appreciate an overview of your results. How much extra code and RAM use is involved? Do you think it is necessary to maintain support for V3.1.1? If the following conditions can be met, maybe V3.1.1 can be abandoned:

bobveringa commented 2 months ago

For the benefit of your internal discussions it's worth pointing out that achieving resilience took a lot of time.

This is also a big part of the reason why we don't want to move away from this library. We consider the Wi-Fi to be some sort of magic transportation layer that is not bound by any rules (be they programming or physics), and that looking at it the wrong way can change the behavior.

At some point I'd appreciate an overview of your results. How much extra code and RAM use is involved?

For now, it is pretty minimal (apart from actually using properties as you need to allocate memory for it).

Do you think it is necessary to maintain support for V3.1.1?

From what I can tell from the various brokers, most brokers seem to support at least basic features of MQTTv5. If support for v3.1.1 were to be dropped, it should be clear and to users that when they try to connect with the v5 client to the v3.1.1 broker, why this is not working and what to do to fix it (either using a specific branch or something else).

According to the spec, this is the behavior on v3.1.1 brokers.

The Server MUST respond to the CONNECT Packet with a CONNACK return code 0x01 (unacceptable protocol level) and then disconnect the Client if the Protocol Level is not supported by the Server

bobveringa commented 2 months ago

I have opened a PR for MQTTv5 support #139 It has been a lot more busy than I anticipated, so I have not had the time for a full cleanup. But the general concepts are in place.

I am not 100% happy with what I have right now, but after discussing this internally, this seemed like the least bad way of doing it. But I am open to any feedback.