mqttjs / mqttjs-v5

Development repository for an overhaul of the MQTT.js client.
MIT License
30 stars 3 forks source link

v.next architectural considerations #2

Open BertKleewein opened 2 years ago

BertKleewein commented 2 years ago

EDIT (@vishnureddy17): As we continue with our vNext work, our thoughts on some of these topics have changed. As such there are corrections and updates in this thread, and some of the earlier posts in the thread may be outdated.

We had a few architectural discussions today. I’m writing them here for posterity and tracking.

Message Stores as a separate component.

There are at least two different places we need to store messages. One is “in flight” messages. The other is “unable-to-send” messages. These need to be two different stores. Deciding what to do with “in-filght” messages (e.g. re-publish, timeout, etc) is the responsibility of mqtt.js. Deciding what to do with “unable-to-send” messages (e.g. fail, persist to disk, keep in ram and retry when connected) is more complex because different customers have different needs. We should support custom message stores for “unable-to-send” messages.

Architecting for QOS-2.

QOS-0 is easy, QOS-1 is a small addition, QOS-2 becomes more complex. We should figure out how we’re going to handle QOS-2 while we’re working on QOS-1. My intuition says that you might be able to shoehorn QOS-1 into the core transport. QOS-2 is complex enough that it warrants a separate state machine to keep track of packet states. If you try to intermingle QOS-2 logic with other code, it will quickly turn into spaghetti.

Re-publishing with the same MID.

For QOS-1 and QOS-2, if you need to re-publish a packet, you need to use the same MID. The current MQTT.JS doesn’t seem to support this, so you can get into a cascading failure scenario where PUBACK packets come a little late, and the code continuously re-publishes with new MIDs and ignores the PUBACKs with older MIDs. This realization has serious implications for design.

Efficient use of timers.

If MQTT.JS code keeps track of timeouts on a per-packet basis, there could potentially be 2^16 timers going. Instead of using individual timers for each packet, we need to be smarter and use a single timer that is smart enough to handle all of these packets. This timer could use a fixed period (e.g., fire once a second and handle all of the expirations that happened in that one second period) or it could use a variable period (e.g. keep track of the next expiration, then re-set with the expiration after that when the first expiration fires).

External transports.

The core mqtt.js package should include some set of standard transport implementations (eg. TCP, TCP-TLS, websockets). Other transports that exist in the current version should e supported via some sort of “pluggable transport interface”. It should not be necessary to update the core library to add a new transport.

Using “this is more than P0 or P1” as a hint to make a separate component.

Many of the features, such as offline message store and QOS-2, are not included in initial releases because they’re “less important” or “not part of core functionality”. We should use this as an indicator that we might want this functionality separated out of the code and moved into a separate module (either a separate .ts file or a different package).

@mqtt namespace.

We already have mqtt-packet and mqtt.js packages. If we start adding more packages, we should consider namespacing.

vishnureddy17 commented 2 years ago

Thanks for this, @BertKleewein!

Regarding:

Many of the features, such as offline message store and QOS-2, are not included in initial releases because they’re “less important” or “not part of core functionality”. We should use this as an indicator that we might want this functionality separated out of the code and moved into a separate module (either a separate .ts file or a different package).

Keep in mind that you can mix different QoS levels with the same MQTT connection for different operations. It's hard for me to imagine how we'd handle this while separating out a QoS 2 specific module.

BertKleewein commented 2 years ago

Another discussion today. Some notes on important things:

What happens when connection drops (socket error or no pingresp)?

We don’t want to automatically reconnect (at the level of the Client object at least) for reasons that I'll outline below.

If we take this as a given, there are two things to consider about handling dropped connections.

  1. What happens to the Client object? Does it become somehow "invalid" and the caller is forced to create a new Client object before reconnecting. Or does it become marked as "disconnected" and the caller needs to call a reconenct method to reconnect. We can't forget that reconnecting will require re-auth, and this might force us to go down the "new Client object" route.
  2. What happens to messages that are "in flight"? If then caller calls publish and the connection drops, does that publish call fail or does it somehow remain "in-flight"? This may not matter for QOS-1, where we can just re-send, but for QOS-2, we may need to keep "in-flight" status across connections.

    • (If the connection drops after we send pubcomp and before we receive pubrel, we don't know if the message has been delivered or not. If we can keep this status across connections, we can re-send the pubcomp with the same messageId until we get a puberl. If not, we have to fail with an error that says "the message may or may not have been delivered".)

Session state may be different than Client object lifetime

If the user needs to create a new Client object when the connection drops, then we need some concept of session state that is separate from the client. Otherwise we can't properly support cleanSession == false. Part of this session may be the list of in-flight messages based on some of reasons discussed in "What happens when connection drops" above.

Need for external reconnect logic

We do not want to force the caller to accept our logic for "when to reconnect after a dropped connection". Opening a new socket may be a cheap operation (if the code is running in a data center, for instance) or it may be a very expensive operation (if the code needs to "make a call" or "establish an uplink"). I think we want to establish a "default policy" for reconnect, but give the caller a way to override this. This policy should be external to the client object. I have some ideas how we might do this.

Message store tied to client lifetime and reconnect

When we decide how message store works, we need to factor in a few things:

  1. We need the store to cover at least three cases:
    1. publish was attempted and failed,
    2. publish was not attempted because connection was not available, and
    3. publish was in-flight when something catastrophic happened (power loss, app exit).
  2. Different callers have different store requirements. Some callers will want to persist messages locally before we try to send them. In this case, we must store before every publish and the message from the store after the publish is successful.
  3. The message store implementation needs to be external to the Client implementation. How does the user pass one in? If the user passes in a store that has messages, how do we drain that store?
vishnureddy17 commented 2 years ago
  • (If the connection drops after we send pubcomp and before we receive pubrel, we don't know if the message has been delivered or not. If we can keep this status across connections, we can re-send the pubcomp with the same messageId until we get a puberl. If not, we have to fail with an error that says "the message may or may not have been delivered".)

I think we can start the sequence all over again by using the dup flag.

  1. We need the store to cover at least three cases:

    1. publish was attempted and failed,
    2. publish was not attempted because connection was not available, and
    3. publish was in-flight when something catastrophic happened (power loss, app exit).

Maybe I'm missing something here, but in case i., the user was presented with an error. At that point, the user is aware and has the agency to try it again. So would we need to keep it in the offline store?

vishnureddy17 commented 2 years ago

Some thoughts on packet sequencing

We believe that we are going to use the "packet sequence" model to inform our client implementation. Here's the PR which shows off the basic idea. Shout out to @BertKleewein for driving this.

TL;DR: We keep a container of state machines, where each state machine represents an in-flight MQTT operation. When a packet comes in or goes out, it will trigger a state transition for its corresponding operation. This allows us to make sense of all the packets going in and out and keep the code paths for different operations isolated.

Client-initiated sequences

Some operations are initiated by the client (what I'm calling client-initiated sequences), while others are initiated by the server (server-initiated sequences). This post mainly concerns client-initiated sequences. My thinking is that there are six control packet types that can kick off a client-initiated sequence:

  1. CONNECT
  2. DISCONNECT
  3. PINGREQ
  4. SUBSCRIBE
  5. UNSUBSCRIBE
  6. PUBLISH (branching out into different cases depending on the QoS level)

Three main types of client-initiated sequences

I believe we can break up client-initiated sequences into three major cases. I will call them fire-and-forget, non-overlapping, and overlapping sequences.

Fire-and-forget

Fire-and-forget sequences are sequences where the client sends a single packet and doesn't care about any other packet with respect to that operation. In other words, they are singleton packet sequences. There are two fire-and-forget sequences:

  1. QoS 0 PUBLISH
  2. DISCONNECT

These operations are simple enough to where we may not even need to store the state machine in the container.

Non-overlapping

Non-overlapping sequences are sequences where no other sequence of the same initial control packet type can be in-flight at the same time:

  1. CONNECT
  2. PINGREQ

Since these sequences are non-overlapping, we don't need to use unique numerical keys to store them, we can just use the name of the control packet type as the key instead.

Overlapping

Overlapping sequences are sequences where multiple of them are allowed to be in-flight at the same time. As such, they need unique identifiers, and the Packet ID of the initial packet in the sequence can serve this purpose. In other words, a client-initiated sequence is overlapping if and only if its initial packet has a packet ID:

  1. SUBSCRIBE
  2. UNSUBSCRIBE
  3. QoS 1 and 2 PUBLISH
YoDaMa commented 2 years ago

@vishnureddy17 your comment all looks good. The only consideration is I'm trying to understand how Will Messages during Disconnect affect the sequencing of packets... I don't think it will contradict what you are currently saying tho.

BertKleewein commented 2 years ago

@vishnureddy17, another non-overlapping sequence on MQTT5 is for AUTH exchanges when re-authenticating. From what I can tell, this is like a CONNECT exchange, except without the CONNECT packet being sent. This doesn't affect your design, but it will affect the code around the sequencer. For example, I'm not sure if you can publish or subscribe between CONNECT and CONNACK, but you can continue do these things between AUTH and CONNACK if you're re-authenticating.

vishnureddy17 commented 2 years ago

@BertKleewein Good to know. I just assumed that AUTH packets were always preceded by a CONNECT.

vishnureddy17 commented 2 years ago

Revisiting our previous thinking

After a more careful reading of the MQTT spec and discussion with @BertKleewein, it seems like the following was misguided:

Re-publishing with the same MID.

For QOS-1 and QOS-2, if you need to re-publish a packet, you need to use the same MID. The current MQTT.JS doesn’t seem to support this, so you can get into a cascading failure scenario where PUBACK packets come a little late, and the code continuously re-publishes with new MIDs and ignores the PUBACKs with older MIDs. This realization has serious implications for design.

Efficient use of timers.

If MQTT.JS code keeps track of timeouts on a per-packet basis, there could potentially be 2^16 timers going. Instead of using individual timers for each packet, we need to be smarter and use a single timer that is smart enough to handle all of these packets. This timer could use a fixed period (e.g., fire once a second and handle all of the expirations that happened in that one second period) or it could use a variable period (e.g. keep track of the next expiration, then re-set with the expiration after that when the first expiration fires).

In particular, the MQTT spec seems to forbid retrying operations (quote taken from MQTTv5 spec, but a similar statement is in v3.1.1):

When a Client reconnects with Clean Start set to 0 and a session is present, both the Client and Server MUST resend any unacknowledged PUBLISH packets (where QoS > 0) and PUBREL packets using their original Packet Identifiers. This is the only circumstance where a Client or Server is REQUIRED to resend messages. Clients and Servers MUST NOT resend messages at any other time [MQTT-4.4.0-1]

(emphasis mine)

Now that we know that we shouldn't retry, we no longer need to be concerned about timer efficiency. We only need a timer for the keepalive mechanism. (There might be some other places where a timer is needed, but the number of timers needed will be small nevertheless).

vishnureddy17 commented 2 years ago

Some thoughts on error checking and validation

Checking that individual packets are well-formed and respect the MQTT specification

For checking individual packets, this should be done by mqtt-packet, which is one of the core dependencies MQTT.js and a package that the maintainers of MQTT.js can make updates to as needed. An error would be raised here if either we receive an invalid packet from the server (either maliciously or because there is a bug in the server implementation) or the MQTT.js client attempts to send an invalid packet to the server (which should only happen due to a bug in the client implementation). In either case, we should immediately shut down the connection.

Checking that packets make sense in context

It is possible that an individual packet is valid when looked at in isolation, but is problematic due in context due to:

Such problematic packets should be detected by the packet sequencer. If the packet in question is being sent to the MQTT.js client by the server, we should immediately shut down the connection. Otherwise, if the packet is being sent by the MQTT.js client to the server (which would imply that there is a bug at a higher level in the MQTT.js implementation), we should bubble up the error to the caller, but keep the connection alive since no invalid packets were sent over the wire in that case.

Checking that expected packets are received in a reasonable amount of time

If the MQTT.js client is expecting a packet (e.g. waiting for a PINGRESP after sending a PINGREQ, waiting for a PUBACK after sending a QoS 1 PUBLISH) and doesn't receive the packet from the server in a reasonable amount of time (which should be configurable by the user), the client should disconnect and the user should have the ability to attempt to reconnect with the same session. Note that this can happen even if the user chooses to disable keepalive.

User error

The user of the MQTT.js client will not directly use the packet sequencer but will instead interface with a layer above the sequencer. This upper layer should detect any user error and throw errors as appropriate.

vishnureddy17 commented 2 years ago

An open question on the MQTT spec

In MQTT, the PUBLISH (where QoS > 0), SUBSCRIBE, and UNSUBSCRIBE, packets require a unique Packet Identifier. It is important to free up packet IDs for reuse, as there is a maximum of 2^16 unique packet identifiers that can be used simultaneously.

If the client sends a publish where QoS > 0 and disconnects before the QoS handling is complete, the MQTT spec requires the client to retry the publish upon connecting with the existing session. This mechanism allows us to free up the packet ID, as it gives us an opportunity to restart the QoS process.

An open question we have is how do we handle un-acked SUBSCRIBEs and UNSUBSCRIBEs? The MQTT spec does not mention the client having to retry un-acked SUBSCRIBEs and UNSUBSCRIBEs. If we disconnect before receiving the SUBACK or UNSUBACK, do we just lose those packet identifiers for the rest of that session? Are we allowed to retry the SUBSCRIBE and UNSUBSCRIBE to give us an opportunity to get back those packet identifiers?

hoIIer commented 2 years ago

@vishnureddy17 Dropping this here since I just encountered it and may be of consideration regarding timers https://github.com/mqttjs/MQTT.js/issues/1257