Better error handling in the HAL

Elizafox commented 6 months ago

This is going to be an involved process, but I think it's a good idea to begin doing error checking in the HAL and reporting errors back up the stack and then aborting appropriately (maybe a HAL abort() function when things really go south? also returning errors when relevant).

Although it's safe to assume on Arduinos that functions will either succeed or fail in an irrecoverable way and therefore it doesn't really matter, on Linux and such this may not be the case. For example, we might not be able to open the GPIO device, or we can't acquire a resource.

I think this would be immensely helpful to consumers of the library, especially on non-Arduino platforms.

jgromes commented 6 months ago

Could you elaborate a bit more on what this would look like?

It's true that on MCUs typically the only recovery method is a reset and/or power cycle. We also don't expect things like SPI initialization or GPIO writes to fail, because in a multithreaded environment (e.g. some RTOS), if the SPI/GPIO is to be considered a shared resource, it is up to the user of the library to guard that resource properly - RadioLib is agnostic about this, that's why HAL exists. However, I don't think it's up to the library itself to be deciding what is and is not recoverrable, that should be delegated to the HAL implementation for the specific platform.

For example, some Linux HAL might implement additional error checking when accessing system resources (GPIO and SPI), but if a radio transciever refuses to execute a command (e.g. attempting to set an out-of-range carrier frequency), I don't think the error handling on Linux will be different than on any MCU.

Elizafox commented 6 months ago

An example would be init(). Instead of returning void, it can return an int (which can be ignored on Arduino or other places we generally assume it's going to succeed). If something goes south during, say, GPIO init, we can report it to the user. Same with spiBegin() etc.

Things you might expect to be infallible can fail on Linux too. Even delay functions can fail, for instance, if a signal is received and we woke up too soon (though in that case it may be better to just handle EINTR). Or say an SPI transfer fails. Claiming/reading/writing pins also on Linux can fail.

It doesn't make sense of course to put error returns everywhere. For instance, in theory something like clock_gettime can fail; in practise, the only way it would do so is if there was a bug in the HAL itself and there was a colossal screwup somewhere (like requesting an invalid clock).

Generally speaking, the right thing to do if an operation fails and it's not recoverable is tell the application about it somehow. If that's not possible for whatever reason, maybe call an abort() function in the HAL and punt to the user. Maybe log it and call the C library abort(), maybe set a global event and flag, maybe call up a friend and let them know (kidding 😛). That'd be up to the application.

jgromes commented 6 months ago

You are correct that the current HAL interface is mostly based on the assumption that the library has exclusive access to the system resources. This assumption obviously does not hold in Linux, but the real problem is that this assumption falls apart in any multi-threaded environment. Even ones common on MCUs, like RTOS - for example, delay can fail very easily there, just by getting blocked by a higher-priority thread. While no exception will be generated, the behavior will not be what you intended. In other example, acquiring SPI access will likely involve obtaining some mutex, which may already be held by some other thread.

So this should probably be resolved in the HAL. Having at least init, spiBegin and spiTransfer return error code instead of nothing seems reasonable. There's actually not that many calls to these methods in the library source.

jgromes / RadioLib

Better error handling in the HAL #1076