hn / ginlong-solis

Solis inverter ESP8266 data logger, S3 WiFi stick reverse engineering and ESPhome firmware
89 stars 16 forks source link

ModBus CRC check fails / UART traffic gets desynchronised #4

Closed hn closed 1 year ago

hn commented 1 year ago

Sometimes ModBus traffic fails with CRC errors after several hours of correct operation:

[W][modbus:109]: Modbus CRC Check failed! 8498!=10
[W][modbus:109]: Modbus CRC Check failed! 4F02!=10
[W][modbus:109]: Modbus CRC Check failed! 83D2!=10

Once these errors have started, the system is desynchronised and only a reboot will fix the problem.

This topic has been discussed in various places and I would like to collect all the information here.

hn commented 1 year ago

(Copied from here)

[...] Things are generally working very well. But after a few hours, sometimes the ModBus/UART traffic gets out of sync and cannot be resynchronised except by rebooting the stick.

More detail: The correct ModBus response string is 01 04 02 00 01 78 f0 (last two bytes are CRC). The problem is, that the very last byte (0xf0) comes 'late' and gets 'mixed' into the next ModBus response (t=millis() extra debug added by me):

[modbus_controller:035]: t=57137 Sending next modbus command to device 1 register 0xBFF count 1
[modbus:199]: Modbus write: 01.04.0B.FF.00.01.03.DE (8)
[modbus_controller:486]: Command sent 4 0xBFF 
[modbus:042]: t=57228 Modbus received Byte  240 (0xF0)   <----- this is from the previous response
[modbus:042]: t=57234 Modbus received Byte  1 (0x 1)
[modbus:042]: t=57239 Modbus received Byte  4 (0x 4)
[modbus:042]: t=57245 Modbus received Byte  2 (0x 2)
[modbus:042]: t=57251 Modbus received Byte  0 (0x 0)
[modbus:042]: t=57257 Modbus received Byte  1 (0x 1)
[modbus:042]: t=57263 Modbus received Byte  120 (0x78)
[modbus_controller:035]: t=57456 Sending next modbus command to device 1 register 0xBFF count 1
[modbus:199]: Modbus write: 01.04.0B.FF.00.01.03.DE (8)
[modbus_controller:486]: Command sent 4 0xBFF 
[modbus:042]: t=57535 Modbus received Byte  240 (0xF0)  <----- this is from the previous response
[modbus:042]: t=57541 Modbus received Byte  1 (0x 1)
[modbus:042]: t=57546 Modbus received Byte  4 (0x 4)
[modbus:042]: t=57552 Modbus received Byte  2 (0x 2)
[modbus:042]: t=57558 Modbus received Byte  0 (0x 0)
[modbus:042]: t=57564 Modbus received Byte  1 (0x 1)
[modbus:042]: t=57570 Modbus received Byte  120 (0x78)
[modbus_controller:029]: t=57775 Modbus command to device=1 register=0xBFF countdown=0 no response received - removed from send queue

I wonder if this is a problem of LibreTuya or the ModBus component. It's somewhat hard to debug because you have to wait more or less half a day until it appears.

The ModBus component just checks if bytes are available() (esphome/components/modbus/modbus.cpp Modbus::loop), so for me it's unclear where the time lag / desync comes from.

I have a very vague suspicion that the serial.available()=false status may be wrong (and thus the last byte remains in the buffer) and is only set to true again by the arrival of the next real ModBus response. Hmmm.

hn commented 1 year ago

Usually the ModBus CRC errors occur up to 2 or 3 times a day. Now I've added these extra debug messages:

  1. To monitor the Arduino serial ring buffer:

    --- .platformio/platforms/libretiny/cores/realtek-amb/arduino/libraries/Serial/Serial.cpp.orig  2023-05-10 20:41:19.453342550 +0200
    +++ .platformio/platforms/libretiny/cores/realtek-amb/arduino/libraries/Serial/Serial.cpp   2023-05-07 21:18:25.682830865 +0200
    @@ -64,6 +64,7 @@
    #endif
        data->buf->store_char(c);
    }
    +   LT_W("t=%lu peek: %d, available: %d, availableForStore: %d", millis(), data->buf->peek(), data->buf->available(), data->buf->availableForStore());
    
    uart->DLH_INTCR = intcr;
    return 0;
  2. To monitor the ModBus component receive buffer:

    --- esphome/components/modbus/modbus.cpp.orig   2023-05-10 20:39:03.237952577 +0200
    +++ esphome/components/modbus/modbus.cpp    2023-05-07 21:20:04.757293844 +0200
    @@ -41,6 +41,7 @@
    this->rx_buffer_.push_back(byte);
    const uint8_t *raw = &this->rx_buffer_[0];
    ESP_LOGV(TAG, "Modbus received Byte  %d (0X%x)", byte, byte);
    +  ESP_LOGD(TAG, "t=%lu Modbus received Byte  %d (0x%02X) at=%d", millis(), byte, byte, at);
    // Byte 0: modbus address (match all)
    if (at == 0)
     return true;

Curiously, no CRC error has occurred since then (at least for the last three days).

hn commented 1 year ago

@spuke113 When you get a chance, can you check if inserting the log line (just the one line LT_W... above) has any effect on the CRC error? Have you observed any other correlations after the CRC errors appear?

spuke113 commented 1 year ago

This issue has proven to be very hard to catch since it's very intermittent. Device can run days just fine. I rarely have physical access to the inverter (it's not mine), and within limited time window I was unable to catch anything useful with the logic analyzer.

I made somewhat destructive decision and heavily modified that Exceedconn connector to fit RS485 serial adapter and ESP32. Byebye warranty :smile: Stability seems way better - no hiccups so far.

hn commented 1 year ago

After 3 weeks of testing, I can 'prove' that inserting this single line fixes the ModBus problem:

--- .platformio/platforms/libretiny/cores/realtek-amb/arduino/libraries/Serial/Serial.cpp.orig  2023-05-10 20:41:19.453342550 +0200
+++ .platformio/platforms/libretiny/cores/realtek-amb/arduino/libraries/Serial/Serial.cpp   2023-05-07 21:18:25.682830865 +0200
@@ -64,6 +64,7 @@
 #endif
        data->buf->store_char(c);
    }
+   LT_W("t=%lu peek: %d, available: %d, availableForStore: %d", millis(), data->buf->peek(), data->buf->available(), data->buf->availableForStore());

    uart->DLH_INTCR = intcr;
    return 0;

Obviously there is no meaningful explanation for this (at least I don't have one).

Belaial commented 1 year ago

After about 2 hours 30 minutes uptime on my stick I encountered this problem,

[11:22:09][D][sensor:093]: 'WiFi Signal': Sending state -57.00000 dBm with 0 decimals of accuracy
[11:22:13][W][modbus:108]: Modbus CRC Check failed! BED8!=F4
[11:22:13][W][modbus:108]: Modbus CRC Check failed! B17D!=F4
[11:22:13][W][modbus:108]: Modbus CRC Check failed! B17D!=F4
[11:22:14][W][modbus:108]: Modbus CRC Check failed! B17D!=F4
[11:22:14][W][modbus:108]: Modbus CRC Check failed! B17D!=F4
[11:22:14][D][modbus_controller:029]: Modbus command to device=1 register=0xBBC countdown=0 no response received - removed from send queue
[11:22:14][W][modbus:108]: Modbus CRC Check failed! 2225!=C0B
[11:22:15][W][modbus:108]: Modbus CRC Check failed! E1AF!=C0B
[11:22:15][W][modbus:108]: Modbus CRC Check failed! E1AF!=C0B
[11:22:15][W][modbus:108]: Modbus CRC Check failed! D1B3!=C0B
[11:22:15][W][modbus:108]: Modbus CRC Check failed! 1CFB!=C0B
[11:22:16][D][modbus_controller:029]: Modbus command to device=1 register=0xBCD countdown=0 no response received - removed from send queue
[11:22:16][W][modbus:108]: Modbus CRC Check failed! EB7E!=92C
[11:22:16][W][modbus:108]: Modbus CRC Check failed! E97C!=92C
[11:22:16][W][modbus:108]: Modbus CRC Check failed! E97C!=92C
[11:22:17][W][modbus:108]: Modbus CRC Check failed! E97C!=92C
[11:22:17][W][modbus:108]: Modbus CRC Check failed! E97C!=92C
[11:22:17][D][modbus_controller:029]: Modbus command to device=1 register=0xBD9 countdown=0 no response received - removed from send queue
[11:22:17][W][modbus:108]: Modbus CRC Check failed! D04B!=4789
[11:22:18][W][modbus:108]: Modbus CRC Check failed! DE10!=4789
[11:22:18][W][modbus:108]: Modbus CRC Check failed! DE10!=4789
[11:22:18][W][modbus:108]: Modbus CRC Check failed! DE10!=4789
[11:22:19][W][modbus:108]: Modbus CRC Check failed! DE10!=4789
[11:22:19][D][modbus_controller:029]: Modbus command to device=1 register=0xBE1 countdown=0 no response received - removed from send queue
[11:22:20][D][modbus_controller:029]: Modbus command to device=1 register=0xBFF countdown=0 no response received - removed from send queue

Is this also a known problem if I were to run this on a Wemos D1 Mini with regular ESPhome software? Or is the CRC check fail a issue only on the modified Solis S3 stick?

Belaial commented 1 year ago

I guess the issue is not present on other hardware since @spuke113 seems to be running on a ESP32 just fine, any chance you could give me an update @spuke113, things still stable on ESP32?

hn commented 1 year ago

@Belaial Did you apply this patch ?

Belaial commented 1 year ago

I tried but could not find that exact location, could it have changed since you wrote the patch maybe?

I have

~/.platformio/platforms/libretiny/cores/realtek-amb/arduino/libraries$ ll
total 20
drwxrwxr-x 5 user user 4096 Jul 17 11:51 ./
drwxrwxr-x 4 user user 4096 Jul 17 11:51 ../
drwxrwxr-x 2 user user 4096 Jul 25 11:50 SoftwareSerial/
drwxrwxr-x 2 user user 4096 Jul 17 11:51 WiFi/
drwxrwxr-x 2 user user 4096 Jul 17 11:51 Wire/

If I go to SoftwareSerial folder

I have SoftwareSerial.cpp but I could not find the place in that file to apply your patch.

EDIT

Might take a day or two at worst before I reply, vacation from work so not that much around a computer.

EDIT2

Got the CRC Check failed again now, seems to happen every 2-3 hours for me. (only been running the stick since this morning so this is a rough estimate)

Belaial commented 1 year ago

I "solved" this for now with a work around in HA, an automation that reboots the stick if "Grid frequency" has not updated for 5 minutes.

This will track how often it happens and also hopefully prove or disprove if https://github.com/hn/ginlong-solis/issues/17 reboots are not working due to no WiFi connection, depending if they stick stays offline for longer periods of time after my automation reboots it.

hn commented 1 year ago

Interestingly it seems other people have ModBus problems as well: https://github.com/esphome/issues/issues/3912

Notes:

I think this problem is not specific to the Solis S3 stick.

kuba2k2 commented 1 year ago

From what you're saying, it seems that this problem is not related to LibreTiny at all. Am I reading this right?

hn commented 1 year ago

I think that the problem is not related to the Solis YAML file and/or the ESPhome ModBus controller component (speaking in a picture: the "upper" layers).

Therefore, I conclude that it must be related to the "lower" layers, whatever they are. My understanding is that LibreTiny is an implementation of the Arduino core for RTL and other MCUs. If the problem with SDM/Growatt inverters was fixed by fixing the Arduino Core (in 2.0.6), I think it can only be fixed in LibreTiny (but as I said, I don't know to what extent LibreTiny and Arduino Core share code or concepts).

@kuba2k2

kuba2k2 commented 1 year ago

Well, yes - LibreTiny is an implementation of the Arduino framework, as well as a build system based on PlatformIO. LT's initial Serial class implementation was based on Arduino cores for ESP32 or other chips.

It doesn't make sense to compare the codebases between LT and ESP32, as they're completely different. It would make sense, however, to compare what changed between 2.0.5 and 2.0.6 in the serial impl. It's possible that the same change should apply to LT as well. On the other hand, your previous message states that one user hasn't had success after upgrading to 2.0.6 either.

hn commented 1 year ago

I am pretty convinced that I have found the cause of this problem: The Arduino Ringbuffer API is not interrupt-safe. If the serial hardware callback writes to the buffer and the ESPhome component reads from it at the same time, an inconsistent state can result. I was very surprised that the state of the ring buffer is stored in three variables (numElems, iHead, iTail). Especially because of the redundancy, there is a high chance that the ringbuffer will break and it will remain so permanently (exactly the behavior observed in this issue).

By coincidence @KurtE reported exactly this problem last week as well: https://github.com/arduino/ArduinoCore-API/issues/195.

As a quick workaround, I reset the ring buffer when the last char has been read:

int RingBufferN<N>::read_char()
{
  if (isEmpty())
    return -1;

  uint8_t value = _aucBuffer[_iTail];
  _iTail = nextIndex(_iTail);
  _numElems--;

  if (_numElems == 0) {  // workaround: set all three variables to a consistent state
    _iHead = 0;
    _iTail = 0;
  }

  return value;
}

With this patch, ModBus has been working for ~2 days without problems.

@kuba2k2 Do you have any idea what to do? Use SafeRingBuffer, which is part of ArduinoCore-renesas? Wrapping all read calls in NoInterrupts? In theory, all ESPhome components reading serial data have to be changed.

KurtE commented 1 year ago

I am not sure if this helps or not, but I have a related issue: https://github.com/arduino/ArduinoCore-renesas/issues/75 Where the SafeRingBuffer slows the operations a lot.

I have a draft PR in ArduinoCore-renesas to rework some of the Serial code. Still testing it now, and unclear if/when it might be approved.

But as part of it, I included a different implementation for RingBuffer, which I named SaferRingBuffer. https://github.com/KurtE/ArduinoCore-renesas/blob/SerialX_use_write_fifo/cores/arduino/FasterSafeRingBuffer.h Which removed the _numElems member variable, and when asked would compute the available or availableForWrite values from the head and tail members.

hn commented 1 year ago

Redirecting discussion to https://github.com/kuba2k2/libretiny/issues/154

@KurtE Thanks for your suggestion!

hn commented 1 year ago

@KurtE From a performance view it's probably better not to compute the _numElems variable, because available(), isFull() etc. are likely called more often than the r/w methods.

Removing _iHead seems to be trivial (don't know if I'm missing an edge case, didn't double-check):

Workaround for the Arduino ringbuffer, so that it can no longer have a permanently
inconsistent state. Implemented by removing the redunant storage of the buffer
state in three variables.
--- a/packages/framework-arduino-api/api/RingBuffer.h   2023-08-12 13:07:59.074806708 +0200
+++ b/packages/framework-arduino-api/api/RingBuffer.h   2023-08-11 15:35:11.847630525 +0200
@@ -37,7 +37,6 @@
 {
   public:
     uint8_t _aucBuffer[N] ;
-    volatile int _iHead ;
     volatile int _iTail ;
     volatile int _numElems;

@@ -74,17 +73,12 @@
   // current location of the tail), we're about to overflow the buffer
   // and so we don't write the character or advance the head.
   if (!isFull())
-  {
-    _aucBuffer[_iHead] = c ;
-    _iHead = nextIndex(_iHead);
-    _numElems++;
-  }
+    _aucBuffer[(_iTail + _numElems++) % N] = c ;
 }

 template <int N>
 void RingBufferN<N>::clear()
 {
-  _iHead = 0;
   _iTail = 0;
   _numElems = 0;
 }
KurtE commented 1 year ago

Performance is sort of a tricky thing here... As for store_char, thi implementation I am using does not use isFull...

template <int N>
void SaferRingBufferN<N>::store_char( uint8_t c )
{
  // if we should be storing the received character into the location
  // just before the tail (meaning that the head would advance to the
  // current location of the tail), we're about to overflow the buffer
  // and so we don't write the character or advance the head.
  int newhead = nextIndex(_iHead);
  if (newhead != _iTail)
  {
    _aucBuffer[_iHead] = c ;
    _iHead = newhead;
  }
}

No need, simply setup to where the new head would be, if it equals the tail, we were full... If I were in charge of the API, I would probly have the store_char return something, like a bool, so where in the user code you might have a loop, waiting for the isFull to return false and then call store_char, you could simply loop waiting for a call to store_char to succeed...

read_char is also simple:

template <int N>
int SaferRingBufferN<N>::read_char()
{
  if (_iHead == _iTail)
    return -1;

  uint8_t value = _aucBuffer[_iTail];
  _iTail = nextIndex(_iTail);

  return value;
}

Since both of these methods do not share a common member like the _numElems they are reasonably safe to use where one of the two is used to store or retrieve data to handle an ISR like sending or receiving data from the serial port and the other one is used by the sketch to send or retrieve the data. If you need multiple clients to for example do Serial1.write() and you want that to be totally safe, that no data will be lost, you can then use part or all of their SafeRingBuffer code, which looks like:

template <int N>
void FasterSafeWriteRingBufferN<N>::store_char(uint8_t c) {
  synchronized {
    SaferRingBufferN<N>::store_char(c);
  }
}

Where their synchronized, is a macro the adds a for loop, which creates an instance of an object on the stack, whose constructor globally disabled interrupts, and whose destructor will restore the state of if global interrupts were enabled on entry.

With the released RingBuffer code, the UART code within UNO R4 needed to do this on all store and read chars, mainly just to keep that count consistent.

On some platforms you could probably keep the count in a safe way, like on some/all risc-v boards, which support atomic operations, you could replace the _numElems++; with something like atomic_add(&_numElems, 1);

Hope that makes sense.