Open bobveringa opened 7 months ago
In principle yes. When @kevinkk525 was involved we discussed this but never performed a detailed study of what was involved. Have you assessed the size of the problem in terms of effort and code size?
I do remember noting that V5 addresses the clean_init
problem - if a node is downed for a long period then powered up with clean=False
it can be overwhelmed with data. clean_init
was our ad hoc fix.
A possible technical issue is RAM usage on ESP8266.
For personal reasons the amount of time I have available is quite limited.
As of right now, we are yet to decide on the scope for this project, and are just in the exploratory phase. This means that we don't have an idea of code size, and I think I labeled the effort internally as "a lot".
There are still some key questions to answer, like:
So there is still a bit of work to do to figure all of that out. Any input you can provide on this is much appreciated.
I understand you have limited time available, I wouldn't ask you to start spending a lot of time on this topic. If you could provide some feedback from your lessons learned and other MicroPython knowledge during the development, that should (hopefully) be sufficient.
I would be glad to provide support as you describe.
Point 1. is clearly a major one which cannot be decided until there is a clearer view of the other points.
Re RAM use, mqtt_as
was initially developed for ESP8266 as the only viable WiFi enabled platform. It has insufficient RAM to compile the module, so it must either be precompiled or preferably frozen. RAM is more plentiful on more modern platforms but people still use ESP8266. One option might be to retain the current code as a "legacy" branch for ESP8266 while advancing the master for V5.
Okay, so I started this document out as a small, oh let's compare the 2 quickly and see what we get. And then well... I ended up with this. So, I'll summarize here and then have a very rough and general plan of what to do next.
It seems like it would be possible to support both 3.1.1 and v5 in the same class, with just a flag to indicate whether it is v5 or not. The reason for this is pretty simple, with the spec, they tried their best to keep things similar enough to make everything doable. So many changes are in the "variable header" of each packet type (both client to server and server to client). But all the v5 stuff comes after the 3.1.1 stuff. So this makes it fairly easy to add support to most methods. On the client processing side, we can however no longer rely on messages being a fixed size. And we will need to use the size attribute to read the entire packet, not that big of a deal either, the order of stuff hasn't changed.
The only thing that is really a lot different is the connect method. I think you can try to combine the functionality in a single method, but nobody would enjoy that. So this might have to be a method for 3.1.1 and v5 (or if we use inheritance, we can just override).
One thing is clear I think, a lot of stuff is now put as properties in the variable header. So, we need a method that can easily add these things in later. This is a bit difficult as these properties do change the length of the overall packet. For 3.1.1 this is not a problem as you only need to send a 0 byte.
So, after all that, here is my very rough, very general and basic plan for dealing with this:
As for support of the various features. Because most things are added in the variable header, if we come up with a clever smart way of exposing this to the user, we should be able to support most features, at least for publishing data. On the receiving side, I think the handler method would have to be changed to include some of these additional properties.
Based on how few brokers support the new Enhanced Authentication standard, I don't think we should worry about supporting it at this time. If this library is upgraded to the v5 standard, people who are advanced enough to be using those features can implement that themselves, and hopefully, contribute those changes back.
@peterhinch What do you think of this plan? You can also read my analysis below, maybe you have a different take than I do with how to proceed.
For this comparison, I used the following versions of the spec:
The goal of this analysis is to compare the specifications and see what changes need to be made to add support for MQTTv5.
Most of the MQTT control packet types have remained the same. With 2 notable changes:
DISCONNECT
packet type is now both Client to Server and Server to client, instead of only Client to Server. This means that support for this should be added to the client.AUTH
was introduced, this is both Client to Server and Server to Client. This is to support the newly introduced "Enhanced authentication" part of the MQTT specification. See section 4.12 of the v5 spec for more info.Most of the CONNECT
packet headers stays the same. They did introduce "CONNECT properties" which are encoded in the variable header. This is a breaking change, as you either need to include properties or specify that there are no properties (with a 0 byte).
The properties include:
There are also changes to the CONNECT
payload. Most of these changes boil down to adding the
same features, "publish" has to the will of the device. But this makes the payload different from the v3.1.1 spec in a difficult way, I think.
The MQTTv5 introduced some changes to this packet, they are known in the spec as "PUBLISH properties" and they are part of the variable header for the packet. This seems to be a breaking change, as you must either include publish properties or explicitly specify that there are no properties. The paho.mqtt implementation of this is here. The example is the v5 spec is in figure 3-9 in section 3.3.2.3
The "publish properties" include a lot of the new features of the spec:
An interesting thing to note is the way topic aliases are handled. They are not a separate action that is executed, instead when publishing a message, you can include a topic alias (2 byte integer). In subsequent messages, the topic can be omitted and only the alias needs to be provided. However, this means that to have a guarantee that a topic alias is set, you need to be transmitting at least with QoS =1. Another interesting part is how distributed brokers handle these situations. The broker we use (AWS IoT Core) is highly distributed and the order of messages is "best effort". Given that the consequences of sending an invalid topic alias are the immediate termination of the connection by the broker (section 3.3.4 PUBLISH Actions) best for clients to thread lightly when using this feature. Additionally, topic alias only apply for the duration of the NETWORK connection, not the session, so if you lose the connection the aliases get lost.
But I don't think we need to think about that too much with the implementation. This library can essentially just serve as a wrapper for the protocol, and allow the user to specify the topic alias without any further processing.
When processing PUBLISH
packets, there is no need to support topic aliases, and this can be indicated when connecting to the broker. So there should be only a little additional work on the processing side of this library (only the work in supporting publish properties).
The PUBACK
packet has been expanded to include 2 new features: Reason codes and "PUBACK properties". They are appended to the variable header after the packet ID MSB and LSB see section 3.4.2 PUBACK Variable Header.
We could add support for the reason codes, but I don't really have a clear picture of where this information would go (other than a debug print). And I don't think I can fully test most of these features, as the AWS IoT Core broker is not that forgiving with most of these packets. For example, when you are not authorized to publish, the broker just immediately terminates the connection. But adding a debug print could be useful, and wouldn't be too much work.
There is also a reason string (a full UTF-8 string with a reason) and User Property field(s). But I'd say that for now we can ignore these features and keep things simple.
As for work on the library, now after a PUBACK
only 2 bytes of the variable header are read (PID MSB and LSB). This needs to be modified so that the entire packet is read as it could now be variable due to properties that may, or may not, be included.
Much like the PUBLISH
packet, the SUBSCRIBE
packet was also expanded with additional properties, known in the spec as "SUBSCRIBE Properties" see section 3.8.2.1 SUBSCRIBE properties. These properties are, just like the "publish properties", appended at the end of the packet. This means that when subscribing as a v5 client, an additional byte needs to be sent that says that there are no properties.
The "subscribe properties" are:
Subscription Identifiers seem like they have a purpose, I think it when you pass this property, the publish packet the client receives might include this subscription identifier. You can then check the subscription identifier rather than trying to match the strings.
The v5 spec also introduces "Subscription Options" which are additional settings in what used to be the "Requested QoS" byte, see 3.8.3.1 Subscription Options.
The "subscription options" now contain the following settings:
Option 1 remains unchanged from previous versions. The retain as published might be useful, but given that it is just a single bit at subscribe time, doesn't hurt to implement it. Retain handling seems the most useful here. It allows the client to say what it wants to do with retained messages (0=normal, 1=send if not subscribed already, 2=don't send)
However, all the changes to the subscribe payload body are non-breaking. So leaving all these options at 0 is the same as current behavior.
With the SUBACK
packet is much the same story as PUBACK
They added new SUBACK
properties. But these can all be ignored. They re-used the reason code they were already using to indicate what QoS the broker subscribed with.
There are changes in the variable header, so this needs to be handled in the same way as the PUBACK
just making sure we read the entire packet, so things don't break.
The UNSUBSCRIBE
packet mostly stays unchanged. However, they did add support for "UNSUBSCRIBE properties", these similar to SUBSCRIBE
and PUBLISH
, I think there is something wrong with the spec as it doesn't include a figure that shows the variable header, and the figures jump from 3.28 to 3.30. But given the description present for the v5 spec, you just need to send an additional byte indicating that there are no properties.
The UNSUBACK
packet follows much the same story as the rest of the packets. They added User Properties, and reason codes. The packet does have payload now. But like other things, the PID is still in the first 2 bytes. The rest of the packet just needs to be read, but doesn't really need to be processed for an initial version.
No changes to these packets (Yay!)
The format of the DISCONNECT
packet has not been changed and is fully compatible. There are some extensions in v5. For example, it is now possible for both the client and server to specify a reason for disconnecting. The "normal disconnection" is reason 0x00, which is the same value that the 3.1.1 requires. The full list of reason codes is listed in section 3.14.2.1 of the spec.
The server can now also send this packet, so this does need to be implemented. But it is a simple packet, so that is doable.
The AUTH packet is a newly introduced packet in the v5 spec, it is part of the "Enhanced Authentication" introduced in v5, this article by HiveMQ is pretty good.
This packet does not need to be implemented unless we want to add support for enhanced authentication.
That is a very detailed review - it looks like a big job. A few broad-brush comments.
qos==2
?For the re-write (if this is something for a later date) I can start collecting data from the fleet of devices, we have deployed. Some are operating in truly horrific Wi-Fi conditions, which may produce interesting results. I intended to add this data collection anyway to our fleet just to help with support at the customer locations. So, if there is an interest in specific things that produce insights, I can see if it is possible to add those.
If there is interest in a re-write and I somehow find the time to properly look into that, I would like to help with the design and implementation. There are limitations we have run into with the current implementation, but they are minor enough that they don't warrant the engineering time spent on improving them. However, if it ever comes to a redesign, then we can allocate the required resources to investigate what these limitations are exactly, and maybe some possible solutions.
As for how to continue the MQTTv5 implementation. I think I have enough information on how to proceed at this time. I have some meetings next week that should (hopefully) provide me some insight into when there is enough space on the engineering calendar to start on the design and implementation.
Taking your points in order:
qos==2
.qos==2
ever being requested. I did consider implementing it, testing using a local broker (mosquitto), but I'm entirely happy with leaving it unsupported.I'd be interested to hear details of your application (if you're in a position to divulge).
Alright, good to know your position on these things.
I'm happy to talk about the details of our application. Just not publicly. Do you have somewhere to reach you privately, be it a teams/voice call or just e-mail?
I understand, if you don't want to give out your information publicly (I would also rather not), if you send an e-mail to info@smartfloor.com. Your contact details should get to me (just mention my name in the e-mail).
FYI you might want to see https://github.com/peterhinch/micropython-mqtt/issues/132
Just an update on this. I've been very busy with other, more pressing tasks, but it is likely I will start to work on this over the next few months.
Thanks for the update.
I got bored over the Easter weekend and decided to give the implementation a go. Turns out getting most of it to work was not even that difficult. However, actually doing something with the new properties that are returned is difficult. Which is why, for now, all the properties that come from the broker, are ignored.
My current assessment, is that it would be very difficult to do something with the properties returned while adhering to the other goals that this library has. But I'll also discuss this internally to see what the options are, maybe I am just overlooking something.
While what I have now works, it does not look pretty. I'll spend the next couple of days cleaning up the code a bit and finalizing some of the features. You can expect an initial PR somewhere this week (or next week at the latest).
One issue I mentioned earlier is the clean_init
arg discussed here in the docs. I provided this because I believed it to be necessary in a microcontroller context, although the way it is implemented is hacky. When Kevin and I discussed V5 we were relieved to see that the new version provided an official means of achieving this behaviour.
I would like to be able to remove the hack, ideally without breaking code that sets clean_init
. I would be grateful if you could give this some thought and suggest a good way forward (whenever you get some time, of course).
Ah, I totally forgot about it.
Ok, I think something can be done, but only for MQTTv5, as just by reading the spec, there is no way to get the same behavior without this hack on MQTTv3.1.1.
Looking into this, I did find an answer to another question I had. The question I had was why rename Clean Session
to Clean Start
.
In the MQTTv3.1.1 spec, it says:
After the disconnection of a Session that had CleanSession set to 0, the Server MUST store further QoS 1 and QoS 2...
But this line is omitted in the MQTTv5 spec. And the additional clarification makes it clear that the disconnect hack is no longer needed in MQTTv5.
The code could look something like this:
is_clean = self._clean
if not self._has_connected and self._clean_init and not self._clean:
is_clean = True
await self._connect(is_clean)
But it becomes considerably uglier to have both MQTTv5 support and MQTTv3.1.1 support in the same function, as you would need to do something like this.
is_clean = self._clean
if not self._has_connected and self._clean_init and not self._clean:
if self.mqttv5:
is_clean = True
else:
await self._connect(True) # Connect with clean session
try:
async with self.lock:
self._sock.write(b"\xe0\0") # Force disconnect but keep socket open
except OSError:
pass
self.dprint("Waiting for disconnect")
await asyncio.sleep(2) # Wait for broker to disconnect
self.dprint("About to reconnect with unclean session.")
await self._connect(is_clean)
Also, a bit of an update regarding the status of this project internally. We have an ongoing internal discussion about doing a complete ground up rebuild for MQTTv5 support. We are still unsure about it is because we like working with open-source software that is actually used, it is just way less risk as more people are using it and more people are finding bugs. And we feel like splitting the community with our own thing is also not beneficial. For now, we will continue our development of MQTTv5 here as a rebuild and redesign will take a lot more time, so we will try our best to get as many features as reasonable added here.
For the benefit of your internal discussions it's worth pointing out that achieving resilience took a lot of time. It involved tests including outing brokers and AP's, putting a running client into a Faraday cage, slowly moving a running client out of wireless range and then back in. Deliberately creating radio noise. And repeating on a variety of host hardware.
As an aside my original industrial experience was in radio hardware development. Some software engineers find it hard to grasp that WiFi is subject to the laws of physics rather than conforming to the magic of TCP/IP...
At some point I'd appreciate an overview of your results. How much extra code and RAM use is involved? Do you think it is necessary to maintain support for V3.1.1? If the following conditions can be met, maybe V3.1.1 can be abandoned:
For the benefit of your internal discussions it's worth pointing out that achieving resilience took a lot of time.
This is also a big part of the reason why we don't want to move away from this library. We consider the Wi-Fi to be some sort of magic transportation layer that is not bound by any rules (be they programming or physics), and that looking at it the wrong way can change the behavior.
At some point I'd appreciate an overview of your results. How much extra code and RAM use is involved?
For now, it is pretty minimal (apart from actually using properties as you need to allocate memory for it).
Do you think it is necessary to maintain support for V3.1.1?
From what I can tell from the various brokers, most brokers seem to support at least basic features of MQTTv5. If support for v3.1.1 were to be dropped, it should be clear and to users that when they try to connect with the v5 client to the v3.1.1 broker, why this is not working and what to do to fix it (either using a specific branch or something else).
According to the spec, this is the behavior on v3.1.1 brokers.
The Server MUST respond to the CONNECT Packet with a CONNACK return code 0x01 (unacceptable protocol level) and then disconnect the Client if the Protocol Level is not supported by the Server
I have opened a PR for MQTTv5 support #139 It has been a lot more busy than I anticipated, so I have not had the time for a full cleanup. But the general concepts are in place.
I am not 100% happy with what I have right now, but after discussing this internally, this seemed like the least bad way of doing it. But I am open to any feedback.
We've been using this library for a while and have been really happy with the performance and usage. However, we recently took a look at how we could improve some of our cloud communication and discovered a need for MQTTv5 features like request-response, topic aliases and some of the expiry features.
From our perspective, it seems like we will need to develop these features anyway, but we would like to contribute these features back to the community. Given that this is (in our opinion) the go-to library for MicroPython MQTT support, It makes sense to work together to add support here.
Are you interested in adding support for MQTTv5? Our implementation could probably help from your experience with the current library, which would be beneficial for everyone using it.