helgeerbe / OpenDTU-OnBattery

Software for ESP32 to talk to Hoymiles Inverters and Victrons MPPT battery chargers (Ve.Direct)
GNU General Public License v2.0
254 stars 56 forks source link

Follow up tasks due to the introduction of TaskScheduler #565

Open helgeerbe opened 6 months ago

helgeerbe commented 6 months ago
          I appreciate the changes introduced with this PR, i.e., the work done in the upstream project to use TaskScheduler in favor of extending the `loop()` paradigm. This has the potential to make the project more efficient and responsive.

Where are the changes to the code you have written

@stefan123t They are part of the merge commit 7c11a5a03aebb197d240c73bd582b487c3aed4b5. @helgeerbe did all the work required to migrate all OpenDTU-OnBattery-specific components to TaskScheduler. Have a look at the changes to src/Battery.cpp in the aforementioned commit for an example.

He did it in the most backwards-compatible way, which makes perfect sense. This means the components have their loop() methods made private, but they are called periodically as often as possible, just like before. In the near future, we should update the respective tasks' frequency dynamically (which is explicitly supported) depending on the actually required frequency. The DPL uses a dynamic backoff time which should be used here to schedule the next iteration of the DPL's loop at the appropriate time instead of checking every time in the loop() whether or not the backoff elapsed. The same goes for the JK BMS Controller for sure. Also the VE.Direct loop() can be called at a reasonable frequency since we know only one message per second will arrive.

Regarding the mutex in BatteryClass: This protects _upProvider and must be part of the newly implemented updateSettings(). Currently that function accesses _upProvider without locking the mutex, which happens when the battery settings are changed from the WebUI. This is probably an oversight, as it was done the right way in VictronMppt.

And this is where my concers start: As far as I can tell, AsyncWebServer does indeed still run in a thread other than the main loop() thread, i.e., the TaskScheduler context.

As before, all `loop()´ functions that we care about run in the same context. Now they are part of the TaskScheduler. However, also as before, some callbacks run in the context of the respective caller, i.e., MQTT and WebServer callbacks in particular. Those are not part of the TaskScheduler context. Same with the WiFi driver (and potentially other drivers as well), which also may execute callbacks that access shared data structures.

As an example I looked closer at AsyncWebServer. I don't see that AsyncWebServer would have any knowledge about TaskScheduler such that it would ask TaskScheduler to execute callbacks as TaskScheduler tasks. Instead, the callbacks are (still) executed in the "async_tcp" FreeRTOS thread, which is scheduled pre-emptively alongside the TaskScheduler thread.

Hence I don't see how we could remove mutexes and locking mechanisms, as the concurrency issues are the same as before.

This is based on reading code as I am unable to conduct experiments right now. However, I am fairly confident that my assessment holds true.

I also used the original MessageOutput, since it seems to be thread safe.

It absolutely is not. Serial output will be garbled from time to time. Not within a single println() or printf() call, but in between words or formatting boundaries. And worse, we are back to the original limitation of the static buffer, which causes JK BMS and VE.Direct verbose logging messages to be cut (very) short in the web console. I will propose a new PR that restores the previous MessageOutput implementation by myself. It ran perfectly well for months now at all devices that used respectively recent releases.

For the GridProfileParser, the library "Frozen" was added. That's also very interesting and can serve performance improvements and code reduction for JK BMS and VE.Direct at least.

If this was not already released, I would argue to go ahead and do so. If users experience errors, we can address them as needed. This is a good time for experiments as sun is scarce anyways and people might have more time than usual to report, inspect and fix issues.

350 was effectively reverted. Let's keep that in mind in case spontaneous reboots or data corruption are reported while using the MQTT interface. We might need to have a look at all MQTT callbacks within OpenDTU-OnBattery.

Originally posted by @schlimmchen in https://github.com/helgeerbe/OpenDTU-OnBattery/issues/556#issuecomment-1871283945

helgeerbe commented 6 months ago

Hi @schlimmchen I really appreciate your valuable feedback. To track your suggestions I create a task list.

helgeerbe commented 6 months ago

@schlimmchen do you know when the tasks switch will happen in TaskScheduler loop?. Always after a full loop cycle or also in between? My assumption is (without any knowledge), that it runs at least one complete loop cycle. MQTT callbacks have an internal mutex to handle the complete data. So there might be no problem.

schlimmchen commented 6 months ago

Hi @helgeerbe,

I really appreciate your valuable feedback.

and I enjoy contributing to this project and working with you :blush:

do you know when the tasks switch will happen in TaskScheduler loop?. Always after a full loop cycle or also in between? My assumption is (without any knowledge), that it runs at least one complete loop cycle.

What task switch? As far as I understand the TaskScheduler, only its tasks will not preempt each other. However, other FreeRTOS threads like the MQTT client and AsyncWebServer will preempt each other and the TaskScheduler scheduler pass. So there are tasks (TaskScheduler tasks) and threads (FreeRTOS tasks). Tasks will run in the context of one thread, while other threads are still around (MQTT client and AsyncWebServer in particular). Threads are still scheduled preemptively by the underlying FreeRTOS.

However, your question suggests the idea that TaskScheduler might take care that one scheduler pass is never preempted by other threads. I am quite confident that this is not the case. In particular, this would force TaskScheduler to be aware of the type of OS it is running on and call respective guards or functions. That makes no sense.

The TaskScheduler documentation is not explicit about this issue. However, have a look at #define _TASK_THREAD_SAFE. This suggests to me that TaskScheduler is only concerned about its own tasks, not about other schedulers, i.e., the FreeRTOS scheduler which still preemptively schedules threads. Three of those threads are (1) the TaskScheduler scheduler, which runs in the main loop(), (2) the MQTT client, and (3) the web server. There certainly are more (Pylontech battery component uses a FreeRTOS task, the WiFi driver?) and there will be even more (thread that handles SoftwareSerial outside of the main loop for Victron devices (an idea I have)).

Are we on the same page now? I think I researched this now to a point where I feel confident to have understood it corretly: We still need to take care to lock FreeRTOS threads against concurrent access of shared data.

Related to researching this: https://github.com/tbnobody/OpenDTU/pull/1598.

MQTT callbacks have an internal mutex to handle the complete data. So there might be no problem.

Where? Within the lib? Or within the callbacks in our part of the code? The former I don't know about, the latter would surprise me...

helgeerbe commented 6 months ago

@schlimmchen Thanks for the deep dive. I hope, that I got the complete picture now.

Regarding the mqtt callback, ist should be in the lib. Ahh, it's in the esspressif MQTT (https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/protocols/mqtt.html).

This API is could be executed from a user task or from a MQTT event callback i.e. internal MQTT task (API is protected by internal mutex, so it might block if a longer data receive operation is in progress.

Too many frameworks. I have to check the our mqtt lib.

helgeerbe commented 6 months ago

@schlimmchen I tried to find some more information. In MQTT it seems, that the callbacks seems to be thread safe. So on subscribed messages the callbacks wouldn't interrupt themself with an updated message. I found the hint, that received objects should be synchronized with the main process.

Checking quickly the message handler of dpl and a battery, I found only switches or limits, but those are atomic data. Do you have an example, where we have to synchronize the instance a complete object, while otherwise changing only single values will corrupt the whole object?

schlimmchen commented 6 months ago

By grepping for "mqttsettings.subscribe" I found four users:

  1. Topics controlling inverters (MqttHandleInverter.cpp)
  2. The Power Limiter
  3. The Huawei AC Charger
  4. The Power Meter if reading MQTT value(s)

Assessment:

  1. If issues are found here, they must be reported upstream. Also, these callbacks will probably all use the public Hoymiles lib, and tbnobody did tigthen that interface down with regard to thread-safety. So let's assume everthing is beautiful here.
  2. Let's ignore that the mode might be set even though the PowerLimiter was just disabled. Then we could glance over the fact that the PowerLimiter mode is written from the MQTT callback without locking/synchronization in place because that MQTT callback is the only function that ever writes the PowerLimiter mode variable. We could also do it the right way and pre-process the MQTT topic's payload and enqueue the retrieved value into a thread-safe container... This is also preferrable over making the mode std::atomic or using a mutex in the PowerLimiter to protect the mode variable, simply because it is not obvious at all why thread-safety is required when looking at the PowerLimiter implementation, but it is quite obvious (or easy to note) in the MQTT handler implementation.
  3. Well, this is a problem... It uses setValue() and setMode(), both of which do digitalWrite() and set multiple private member variables and call other methods as well. This is certainly not safe. Given that it would be a nightmare to make Huawei_can.cpp thread-safe, I suggest a thread-safe queue (of bound functions maybe?) that is handled as part of a TaskScheduler task, i.e., within the MqttHandleHuaweiClass::loop() method.
  4. Oh no, this is unsafe in the first line within the callback: The complex variable _mqttSubscriptions (a map) is iterated here. That's simply not safe at all. Given that new values for the power meter MQTT topic may arrive quite frequently, changing the power meter settings might very well lead to problems. This must be fixed. [The fact that all power meter interfaces are implemented in the same CPP file is unsettling... It is partly the reason why the web api for the power meter is restarting the ESP rather than trying to apply the new settings... Which, of course, prevents the aforementioned map from being updated from a different thread :grin:]
schlimmchen commented 6 months ago

I am not touching the power meter implementation. The only thing we can improve is accessing the map of subscriptions. However, that map is only initialized at boot and is not written afterwards (since the Web App handler reboots the ESP when updating the power meter setting). Making the access to _powerMeter<n>Power and _lastPowerMeterUpdate safe requires a refactor. Currently, we can ignore that these variables are not protected since they are only written by a single thread at a time.

And if I am going to refactor this, it's going to be nuked altogether and the respective power meter implementations will move to their own classes. A whole bunch of problems will be gone, then. Especially the need to reboot the ESP on powermeter settings updates.

schlimmchen commented 6 months ago

To summarize:

  1. No action since we assume the Hoymiles lib is mostly thread-safe.
  2. Power Limiter: see #572
  3. AC charger: see #571
  4. Power Meter: There definitely are issues. However they can be ignored since the data structures in question are only written once or from a single thread.
stefan123t commented 6 months ago

Thanks for pointing that out @helgeerbe I probably did not spot the TaskScheduler implementations in the rather long commit and I am still unaware of most of the OpenDTU and onBattery code in general.

I just had the hunch that there is a lot more related to this change using the TaskScheduler than what meets the eye. But I think @schlimmchen made a very conclusive review of the necessary changes.

Regarding the ESPAsyncWebServer becoming synchronous again I would kind of object, because this was one of the main issues in earlier AhoyDTU and OpenDTU implementations. Parallel requests with loads of response data led to several reboots in the past. Also tbnodbody chose a special/newer fork of the ESPAsyncWebServer for a reason I can not remember.

I also quickly peeked into the following article to synchronize threads across multiple FreeRTOS "threads" and/or "fibers"

Use Both Cores on an ESP32: Easy Synchronization with Esp32SynchronizationContext https://www.codeproject.com/Articles/5295781/Use-both-cores-on-an-ESP32-Easy-synchronization-wi

Maybe this distinction is helpful, as the notion of "fibers" was new to me:

Don't confuse FreeRTOS tasks here with the .NET Task class. They're much different beasts. Tasks in FreeRTOS are basically either fibers (cooperatively scheduled) or threads (pre-emptively scheduled by the OS or running on another core). We'll be using them as threads.

schlimmchen commented 6 months ago

Regarding the ESPAsyncWebServer becoming synchronous again I would kind of object

Who mentioned that the web server should become synchronous again? That's not on my list at least. Instead, we do need to keep in mind that the web request run in a different thread (not fiber) and hence those callbacks must not corrupt shared data.

stefan123t commented 6 months ago

I may have misinterpreted your old comment as you were referencing https://github.com/helgeerbe/OpenDTU-OnBattery/pull/350#issuecomment-1666001905 some 5 months ago.

Also, remember that in #308 people suspected that using the web app heavily (for a longer time and/or on multiple clients) causes instability? Would you agree to castrate the webserver to run synchronously in the main loop() in OpenDTU-OnBattery? Would you need a real-world issue or a theoretical issue to accept such a change? Maybe it turns out that the project is unusable when making the webserver synchronous, but I really do think that for the time being, it is no issue.

schlimmchen commented 6 months ago

Oh, wow, your memory is excellent!

Well, the async web server still does run in its own thread AFAIK. And in reality this does not cause too much trouble, if any. Going forward, I think we should do the same thing as with the MQTT callbacks: analyze the web callbacks and make sure they operate on thread-safe methods/data or move the execution of the respective actions into the TaskScheduler context.