richievos commented 1 year ago

I'm not sure what the 5000 is representing, but having it there breaks connectivity for me.

71

hsaturn commented 1 year ago

I do agree, there is still a problem with this. But removing it will break connectivity for others (me at least). Unfortunately, I've fixed my problem without writing a unit test (too bad). I have to rewrite here some part of code.

Thanks for the PR, I won't accept it, but it does shows me that there is a problem that I have to fix.

richievos commented 1 year ago

Can you elaborate on what the situation is that adding a 5000 to that check makes things better? I'm trying to understand if there's a different case I'm going to run into trouble in, because the main branch code fails 100% of the time for me without this change.

hsaturn commented 1 year ago

Not worked on this issue yet.

I think there is a problem around client vs broker client timeout

1 - client connects to broker and timeout after ka 2 - broker receives the connection and create a new client (broker client) with also a timeout 3 - both may expire at the same time, then if the broker checks for timeout, it disconnects the client

This is a case that happening on my roomba project using TinyMqtt.

My unit test there are bad, because there are some strange #if

#ifdef EPOXY_DUINO
  alive = millis()+500000;
#else
  alive = millis()+5000;  // TODO MAGIC client expires after 5s if no CONNECT msg
#endif

I have to review all this (after I've fixed the memory leak problem)

hsaturn commented 1 year ago

I think that the broker should increase the timeout when receiving a new connection to allow time for the client to send Alive message. Maybe this is as simple as this. I should be able to remove those crazy #if EPOXY that modify the behaviour on timeouts during unit tests !!!

richievos commented 1 year ago

I'm not a mqtt expert, but looking at https://m2msupport.net/m2msupport/mqtt-keep-alive-pingreq-pingresp/ I think there's a bigger bug here. I have a mqtt broker which I'm connecting to remotely. The ESP32 is running the broker.

Based on what I'm seeing, and stepping through the code, the MQTT Broker is sending a PingReq to the remote client. Is that intentional? I would expect the broker should only be sending responses.

The reason why the broker is sending the response is because it creates a client instance, giving it the tcp connection. That client then thinks it's talking to a remote broker, so it sends this pingreq.

void MqttBroker::loop()
{
#ifndef TINY_MQTT_ASYNC
  WiFiClient client = server->accept();
  if (client)
  {
    onClient(this, &client); // <--- creates the new client
  }
...

void MqttBroker::onClient(void* broker_ptr, TcpClient* client)
{
  debug("MqttBroker::onClient");
  MqttBroker* broker = static_cast<MqttBroker*>(broker_ptr);

  MqttClient* mqtt = new MqttClient(broker, client); <--- client is built
...

MqttClient::MqttClient(MqttBroker* local_broker, TcpClient* new_client)
  : local_broker(local_broker)
{
...
  tcp_client = new WiFiClient(*new_client);  <-- tcp_client is set
#endif

// now back in broker::loop
  for(size_t i=0; i<clients.size(); i++)
  {
    MqttClient* client = clients[i];
    if (client->connected())
    {
      client->loop(); <--- that new client runs
    }

void MqttClient::loop()
{
  if (keep_alive && (millis() >= alive - 5000))
  {
    if (tcp_client && tcp_client->connected())
    {
      // *****************BUG
      //  None of this code should actually be running when it's a remote connection to this broker
      debug("pingreq");
      static MqttMessage pingreq(MqttMessage::Type::PingReq); 
      pingreq.sendTo(this);
      clientAlive(0);

My removal of the 5000 coincidentally avoids that code running, because my client disconnects before that message would ever get sent.

richievos commented 1 year ago

Note, the fix isn't as simple as putting a if (!local_broker && at the beginning of the check. I believe that pingreq should always have a !local_broker check on it, but if you add that check there then the next conditional fires:

else if (local_broker)
    {
      debug(red << "timeout client");
      close();

That conditional then immediately kills the connection. That happens because alive is:

  alive = millis()+5000;  // TODO MAGIC client expires after 5s if no CONNECT msg

which combined with the - 5000 means as soon as you try and connect, the client is immediately marked as dead, and it's immediately closed.

I'm actually not sure how anything can connect to this broker remotely currently, outside of the async path. Completely possible I'm missing something, but it feels like there's a significant bug there.

I haven't looked into a fix, but I imagine the fix involves:

the client shouldn't ever be sending a pingreq, unless it's a real client, not a client that's there for helping the broker respond
redo the connection handling to only start a keep-alive based timeout after the first set of req/response occurred. To handle connection timeouts on the first request, it should be a separate code path/state machine.
The big fix seems like it'd involve separating the broker's response to a remote server client, form a real client that's connecting to a broker. If those were separated I imagine it'd be easier to keep the handling separate.

hsaturn commented 1 year ago

I'm at work, no time to expand, but you're right. I'm not a Mqtt expert too, but a broker is not expected to send a PingReq. The point is that I wanted TinyMqtt to be as small as possible thus the client is both used for a broker and a client. This is a tricky design not easyt to maintain, but this is the cost for an embedded device and that leads to this kind of error.

Thanks again for the report. You are making TinyMqtt evolve a lot.

I'll review in detail the PR later. No time yet, sorry

hsaturn / TinyMqtt

Remote connections to a broker appear to be broken (was Remove 5000ms offset from alive check) #72

71