philbowles / PangolinMQTT

PangolinMQTT - ArduinoIDE client library for ESP8266, ESP32 and STM32-NUCLEO
Other
70 stars 22 forks source link

Invalid MQTT credentials results in FATAL ERROR - POWER CYCLE REQUIRED #9

Closed leifclaesson closed 2 years ago

leifclaesson commented 4 years ago

...which is perhaps a bit harsh. "Try again" would have sufficed. :)

I'm working with on-mcu interactive menu configuration so typos happen.

Here's the PANGO_DEBUG output:

SEND CONNECT 80 bytes on wire TXQ=1 TCP ACK LENGTH=80 TXQ=0 TCP frag ACK LENGTH=80 acksize=149 amtleft=80 <---- FROM WIRE CONNACK 3FFF5F8A len=4 err1=2 err2=5 <--- printed from my own callback function TCP CHOPPED US! DISCONNECT FH=23680 r=0 FATAL ERROR 2 INFO=5 - POWER CYCLE REQUIRED

Also, the error message is not accurate. I simply commented out the following line in _fatal(): //while(1) PANGO::_HAL_feedWatchdog();

..and now my menu keeps working and I was able to enter the correct login credentials, and then Pangolin connected just fine and appears to work. A power cycle was most definitely not required.

Seriously though, this is not okay -- and here's the reason why.

We have no idea where our libraries get used. A library is a guest in another system. Suddenly freezing the system is not acceptable.

In my particular case, this is is going to be used in my gate controller, controlling the big sliding gate to enter the premises. (Same ones you see in my profile picture). If I'm out walking my dogs, and come back, and the gate won't open, I need to climb a 10-foot wall.

If Pangolin set an error flag and refused to attempt connection until reboot, that'd be bad enough, but I'd still be able to use the 433 MHz remote or the simple HTTP interface.. but if it were to actually lock up, I'd need to get a ladder, climb inside, and then get a screwdriver and open the gate control cabinet and push the reset button......

Or, better yet. What if it was in an MCU controlling a heater, and it got stuck in the ON cycle? It could literally burn a house down.

"Fatal Error - Halting" is a not a concept that may exist in a library. Worst you can do is not run. DOS'ing the system cannot happen under any circumstance.

philbowles commented 4 years ago

I understand your point. The "fatal halt" is to prevent a "boot loop" whish is just as destructive / dangerous / useless as here, but at least this gives you a reason and an obvious indication of the problem, which most libraries don't.

So for example if you have hard-coded wrong credentials in your source, the sketch will never connect. and will sit in a pointless and useless loop trying and failing....being told once and then "halting" is equally functional but far more obvious and efficient.

The intent is that only unrecoverable conditions call this, generally from incorrect code...such as invalid subscription topic etc for which the only solution is to fix the code. I absolutely agree that no run-time recoverable condition should enter this state.

So we are left with a discussion as to whether bad credentials is "terminally fatal". My current postion is that 98% (made up on the spot!) of folk will have their server credentials hard-coded and thus bad credentials are terminally fatal 98% of the time - should we remove this obvious indicator to allow the 2% of edge caeses to have another try?

I'm open to hearing further arguments.

philbowles commented 4 years ago

"DOS'ing the system cannot happen under any circumstance." Absolutely 100% disagree: for example where letting it continously fail would do worse harm.

Would you prefer your garage door to lock shut...or would you like it to slam open and shut repeatedly for a week while you were away?

leifclaesson commented 4 years ago

Hi Phil! Thanks for responding. I see what you mean about 98% vs 2%. Indeed, this is the first time I'm making MQTT runtime-configurable from the front panel. Previous times, I've been one of the 98%. I'll buy your made-up number, it's reasonable! ;)

MQTT isn't the only thing in the system, though. If I'm developing, and hardcode the wrong credentials, I'll notice that it's not working, and then I'd register an error callback with a Serial.printf to see what's going on. It won't remain a mystery for long. Freezing isn't helpful in that scenario either. In fact, if I code the wrong credentials in, or change the credentials on the MQTT server, or the MQTT server becomes otherwise unavailable.. the gate would still operate the remote just fine even if MQTT was unavailable. In fact, I've been using it for over a year with... drumroll..... AsyncMqttClient. Yeah. It has actually worked. Flaky, sure, but it has never locked me out. The band-aid code that I had to and did write around it probably has something to do with that.

As for the gate opening and closing repeatedly for a week... apples and oranges. Failure to connect to MQTT won't do that. The only way Pangolin could cause that would be to give me spurious messages.

A boot loop wouldn't cause that either in my case, but a boot loop would indeed prevent operation of the rest of the system.

Why would bad credentials cause a boot loop? I suppose it could happen if I ignored return values and just kept calling other library functions with bad pointers, but that wouldn't be very good programming on my part would it? :)

philbowles commented 4 years ago

"why would..." my bad wording, I meant a disconnect / reconnect loop. And usually because most people code a blind "reconnect after fail" loop. SO if it fails for bad credentials, then the code tries to reconenct and....rinse and repeat, which - again is just as unhelpful as a "hang" but far less informative.

I'm not too "hung up" :) on this, It's something I have never done before in this scenario and is easily removed back to the infinite discinnect / reconnect cycle, its just that I see such activity as pointless and annoying.

Apples & oranges to any of your examples by the same token - as both scenarios are initiated by the failure to connect. Our only discussion point is whether its safer to fail to connect once or to repeatedly fail to connect in an infinite loop. Either way, no device is gtetting controlled. IF the user code sets a device state before attemptong the connection and sets it off for safety when it disconnects, then the device will be crashing on and off in an infitie loop with "your" method and locking in one position using mine.

The best solution i think is to make a #define to allow user to choose startegy A or B ...#define HANG_ON_UNREOVERABLE_ERROR for example in config.h

leifclaesson commented 4 years ago

I'd say it depends on how tight this infinite connection loop is.

To immediately reconnect on a connection failure would be quite terrible programming in my humble opinion, because that has the potential of eating an inordinate amount of CPU time during prolonged failure, while blocking any useful work from being done.

In my own LeifHomieLib, I actually do the following:

unsigned long HomieDevice::GetReconnectInterval() { unsigned long interval=5000; if(ulMqttReconnectCount>=20) interval=60000; else if(ulMqttReconnectCount>=15) interval=30000; else if(ulMqttReconnectCount>=10) interval=20000; else if(ulMqttReconnectCount>=5) interval=10000; return interval; }

I attempt reconnection GetReconnectInternal() milliseconds after the MQTT client tells me connection has failed, while also incrementing the reconnect count. Thus, if it keeps failing, we slow down, so that we're not hammering.

Granted, this may be overkill, and we probably shouldn't expect that most Arduino developers would do it this way on their own :).

If you're looking for an eye-catching way of alerting less experienced programmers to an unrecoverable error.. why not simply Serial.printf the error every time it fails?

I believe people would notice, because as soon as something doesn't work, one makes sure one have console output. That's step one in debugging.. isn't it?

One more thing -- why recommend/require a power cycle if it has no chance of solving the problem? Next boot won't be any different unless we re-upload new code first.

SvizelPritula commented 4 years ago

This definitely needs to be looked into. In an application which consists only of sending values over MQTT, an MQTT error is fatal. However, if it does anything else, MQTT should never stop it from doing so. If you get sent to the store to get milk and eggs, and the store is out of eggs, do you stay at the store, forever, or do you get the milk and go home?

Example: Someone builds an automatic plant watering system. Is measures soil humidity hourly and if it falls below a certain threshold it opens a valve for five seconds. Then, he decides he wants to know how fast the would dries, so he adds MQTT. Every time soil humidity is measured, he connects to MQTT and sends a message. Sometime later he accidentally changes auth credentials to the MQTT broker. On the next hour, his ESP tries to connect to MQTT, fails and locks up. However, since this library is async, it will have already opened the valve. When he comes home in the evening, he find his entire garden flooded.

Another example: Someone uses MQTT to control a bunch of servos via a driver connected through serial. One day, he accidentally changes the broker password. When his device tries to connect, it fails, locks up and sends a message over serial. The servo controller interprets it as binary instructions for turning servos and jams up the entire machine.

Point is: Pangolin has no idea what it's being used for. Some apps may work without it and some at least need to crash properly. Sending an error message through a data line that can be used for anything is also a bad idea and cost NASA a rocket. Sure, some apps will just get stuck in a reconnect loop, but they need to sort that out anyway, if, say, the WiFi goes down.

rwagoner commented 4 years ago

My application allows the user to enter their MQTT server and credentials through a web interface. Hanging the device on invalid credentials causes a situation they can't fix even after a reboot. If anything a there should be an onFatal callback allowing the developer to decide what action should be taken.

gnalbandian commented 4 years ago

My application allows the user to enter their MQTT server and credentials through a web interface. Hanging the device on invalid credentials causes a situation they can't fix even after a reboot. If anything a there should be an onFatal callback allowing the developer to decide what action should be taken.

Agreed. I believe this behavior is not ideal. I am running a program where mqtt is optional, and configured through a web panel. So, the potential of misconfiguration is real, but this shouldn't turn the device unusable.

AcuarioCat commented 4 years ago

So for example if you have hard-coded wrong credentials in your source, the sketch will never connect. and will sit in a pointless and useless loop trying and failing....being told once and then "halting" is equally functional but far more obvious and efficient.

I'm open to hearing further arguments.

I hit this exact problem - incorrect details because of a typo. The problem for me was that my update is done OTA and relies on the OTA code being initialized and ArduinoOTA.handle(); being called in the loop. As Pangolin had hit 'FATAL ERROR...' nothing else worked so I had to resort to updating the code via USB.

I have numerous devices that all update OTA so if something changed on my MQTT setup (password/username change for example) all these devices would need to be physically re-programmed with a cable as the OTA would no longer be available.

From my perspective, better to fail the MQTT connect after a few attempts. This, at least, should be recoverable.

mrdc commented 3 years ago

Totally agree: MQTT is not the only one communication protocol a device can have and also MQTT is not the only function a device is executing, so blocking all other useful functions because of wrong MQTT credentials is bad. What to do if it’s wrong credentials in OTA? How to recover a device then? Imagine a car that stops braking as soon as it’s out of windscreen washer :) These sort of errors shouldn’t stop a device from doing useful job.

gnalbandian commented 3 years ago

What do you thinks @philbowles? Any chance we can modify this behavior?

kahoch commented 3 years ago

I use a web interface to config mqtt, once input wrong info, the whole program dead, I have to use cable to upload filesystem because OTA is dead too!

luebbe commented 3 years ago

That's why you don't use libraries like this in production...

leifclaesson commented 2 years ago

Yup, I moved off ages ago.