peterhinch / micropython-mqtt

A 'resilient' asynchronous MQTT driver. Recovers from WiFi and broker outages.
MIT License
579 stars 127 forks source link

MBEDTLS error on disconnect #153

Open bobveringa opened 1 month ago

bobveringa commented 1 month ago

While investigating some issue, I added some more debug prints to most error handlers.

    # Launched by .connect(). Runs until connectivity fails. Checks for and
    # handles incoming messages.
    async def _handle_msg(self):
        try:
            while self.isconnected():
                async with self.lock:
                    await self.wait_msg()  # Immediate return if no message
                await asyncio.sleep_ms(0)  # Let other tasks get lock
        except MemoryError:
            self.dprint("Out of memory.")
        except OSError as ex:
            sys.print_exception(ex)
            print(ex)
            pass
        self._reconnect()  # Broker or WiFi fail.

This results in the following output on disconnect.

Traceback (most recent call last):
  File "lib/mqtt_as/mqtt_as.py", line 835, in _handle_msg
  File "lib/mqtt_as/mqtt_as.py", line 570, in wait_msg
OSError: (-28928, 'MBEDTLS_ERR_SSL_BAD_INPUT_DATA')
(-28928, 'MBEDTLS_ERR_SSL_BAD_INPUT_DATA')

Like pretty much all MBEDTLS errors, it is not clear what is happening. But nevertheless, errors like this cannot be good. Do you have any idea what is going on here? I'm happy to attempt to investigate this error and see if there is a solution.

peterhinch commented 1 month ago

I'm not sure I understand the scenario. Does this occur when connectivity is lost? If so, I'd suggest the following:

  1. It seems unsurprising. When connectivity is suddenly interrupted, an OSError may be thrown at many points in the chain, including in the TLS protocol. That is why, in the published code, the respnse to an OSError is to down the connection and reconnect.
  2. With the published code such MBEDTLS errors may well occur. If they only occur as a result of a loss of connectivity they would have been undetected.
bobveringa commented 1 month ago

The error occurs when explicitly disconnecting from the broker using client.disconnect(). Which is why I started to investigate it more in depth. I wouldn't expect these types of errors in the case of controlled disconnect from the broker.

But it may very well be normal. If these errors are expected, maybe a comment could be added that this is one of the errors?

peterhinch commented 1 month ago

Now I understand. I don't know the answer, however. I have limited experience of TLS.

peterhinch commented 1 month ago

@bobveringa I have pushed a minor change to the read_allocate branch - this was a PR providing a fix to iss 132. It seems that some brokers briefly emit a password fail error prior to a successful connection.

On the big picture how is your testing progressing? Do you think the time is right to merge read_allocate - or should I wait?

bobveringa commented 1 month ago

Testing has been going well. I am pretty confident in the performance and stability of this solution, as I've manged to solve most of the weird problems we have been experiencing (they were related to some weird memory problem, I was unable to find the root cause of this, which is still bugging me).

Unfortunately, I've not been able to roll out the solution to our fleet of internal test sensors for testing. This testing has been delayed a couple of times due to other projects. Hope I have time to do a large-scale test within the next couple of weeks. But these things take more time than I can allocate at the moment.

To summarize. I feel confident that read_allocate is stable, but I don't have the data to back this up yet. For me, it is 50/50 to merge this, but I don't know how much testing of your own you have done.

peterhinch commented 1 month ago

Thanks for the feedback. I'll wait until your large scale test is complete.

bobveringa commented 3 weeks ago

Just wanted to provide a quick update.

I just started with a large-scale test and the intermediate results are looking promising. We have some tests that are much more performant with pre-allocated buffers, and tests also show that we can deal with larger messages than before.

The test is executed with the following settings:

IBUFSIZE = 5600
MSG_BYTES = False
peterhinch commented 3 weeks ago

Excellent, thanks for the feedback.

bobveringa commented 2 weeks ago

After some more extensive testing, we have not run into any issues with the pre-allocated buffers. We think that in our case, performance has significantly improved in certain situations (such as attempting to stream files over MQTT).

I feel fairly confident that this code is stable and can be merged. It should be noted that we set the buffer to a large size ahead of time, so not sure how this would affect people who sometimes receive large buffers.

peterhinch commented 1 week ago

Thank you for the confirmation. I'll document IBUFSIZE and MSG_BYTES and merge it.

not sure how this would affect people who sometimes receive large buffers.

I will document how the mechanism works, making the point that setting a large buffer size is an insurance policy against a large allocation at run time (when RAM might be fragmented).

I suspect most applications use short messages, and my testing has concentrated on this usage. I've never attempted to stream files over MQTT. You might want to add a short doc explaining what you've learned about doing this.

peterhinch commented 1 week ago

Now merged, with docs.