stuartpittaway / diyBMSv4ESP32

diyBMS v4 code for the ESP32 and new controller hardware
Other
184 stars 80 forks source link

Controller stopped sending CAN messages #194

Closed deekoowee closed 1 year ago

deekoowee commented 1 year ago

Hi @stuartpittaway Installed firmware 4e72db94af375c391bad47b9b98507b3271b069d The CAN messaging stopped; 1) Battery SoC was over 100% (~140%, while charging, 'Override State of Charge' was checked) 2) Discharging battery 3) When battery SoC reached 100%, 'CAN sent:' stopped counting. I catch it, because inverter complained missing battery. However discharging continued. I reset controller and 'CAN sent' counting started again and inverter was happy again.

stuartpittaway commented 1 year ago

I've not seen that behaviour so thanks for reporting it.

I'm concerned your inverter did stop discharging if it lost communication with the BMS?

deekoowee commented 1 year ago

Inverter did not stop discharging, when communication to BMS was lost

deekoowee commented 1 year ago

Hi, same problem occured again. No CAN messages to inverter. Inverter continued discharging. When restarting the controller, screen shows a note 'Failed to save settings' Firmware version is same: 4e72db94af375c391bad47b9b98507b3271b069d

stuartpittaway commented 1 year ago

Hi @deekoowee nobody else has reported an issue with CAN comms stopping - this could be either:

  1. A problem with the inverter/charger you have, which stops receiving the messages
  2. A potential faulty ESP32 and/or CAN chip on the controller PCB

I've been running the CAN code for over 6 months now, most recently for 70+ days straight, with over 31 million CAN messages sent! I'm using a SOFAR ME3000SP inverter.

image

deekoowee commented 1 year ago

Hi @stuartpittaway I have continued the testing with another controller with new ESP32. I purchased Seeed Studio's USB-CANBUS logger and now I have canbus log for this issue if it helps.

Today, the new controller stopped sending the canbus messages. Attached the canbus log file, which shows that there's no messages from controller after 19:37:08:772. When CAN communication lost, There's no other messages than inverter ID 0x305 (can be seen at almost end of canbus log file).

To make log smaller, I have cut unnecessary lines from beginning of the log. CANBUS log 20230516_190333.txt

Time of inverter's log is about 00:01:43 ahead to canbus log. Time difference is my good guess, when comparing the point at canbus log, where inverter started to send ID 0x305.

Inverter's log have; 1) 19:26:40 The battery asked for a forced charge. #196 2) 19:38:53 Battery CAN communication lost.

Inverter send ID 0x305 at 19:37:10:371 and 19:37:10:425 Could two ID 0x305 with in 0,1 sec period cause the controller crash (Does the controller read 0x305) ?

Normally ID 0x305 are with ~10sec period except 19:34:41:888 to 19:37:10:371 and 19:37:10:371 to 19:37:47:171

Could there be a problem in data between 19:34:41:888 and 19:37:10:371 , causing the inverter to not sending ID 0x305 ? I mean, if data is making the inverter 'crazy'.

I have also the voltage curve from HA, which shows 'wild' at 19:25:30 (almost same time, when inverter's log have 'The battery asked for a forced charge.' diybms voltage

Because the inverter didn't get any messages from controller, it continued the battery discharging and start charging the battery after controller reset, when canbus started again.

stuartpittaway commented 1 year ago

Thanks for the detail @deekoowee. On the graph was the battery discharging at this point? I don't see an issue with the voltage graph - its just responding to loads on the battery - if you added current to the other axis, I suspect there are higher loads at that point as well.

DIYBMS ignores the 0x305 messages at the moment. Those should be sent by the inverter every few seconds, but shouldn't cause problems if quicker/slower.

stuartpittaway commented 1 year ago

Using the 19:26:40:106 as a time start point, using the CAN logs you supplied, this is the message that's sent from the controller to the inverter. At this point, you mention the inverter said it was being requested to "force charge"

7687   Receive      19:26:40:106      Data frame      Standard frame      00000351      8      0c 02 40 06 c4 09 ce 01       
7688   Receive      19:26:40:121      Data frame      Standard frame      00000355      4      29 00 64 00       
7689   Receive      19:26:40:137      Data frame      Standard frame      00000356      6      24 13 8a ff 2c 01       
7690   Receive      19:26:40:153      Data frame      Standard frame      00000359      8      00 00 00 00 0a 50 4e 00       
7691   Receive      19:26:40:175      Data frame      Standard frame      0000035C      1      c0       
7692   Receive      19:26:40:206      Data frame      Standard frame      0000035E      8      50 59 4c 4f 4e 20 20 20 

However, if we compare this message to an earlier one from 19:24 (when the inverter didn't report the force charge requested), nearly all the data is identical except for the first byte of 0x351 and 0x356

6808   Receive      19:24:01:573      Data frame      Standard frame      00000351      8      13 02 40 06 c4 09 ce 01       
6809   Receive      19:24:01:588      Data frame      Standard frame      00000355      4      29 00 64 00       
6810   Receive      19:24:01:620      Data frame      Standard frame      00000356      6      26 13 47 ff 2c 01       
6811   Receive      19:24:01:642      Data frame      Standard frame      00000359      8      00 00 00 00 0a 50 4e 00       
6812   Receive      19:24:01:657      Data frame      Standard frame      0000035C      1      c0       
6813   Receive      19:24:01:673      Data frame      Standard frame      0000035E      8      50 59 4c 4f 4e 20 20 20       

Lets look at those.... 0x351 – Battery voltage + current limits

6808   Receive      19:24:01:573      Data frame      Standard frame      00000351      8      13 02 40 06 c4 09 ce 01       
7687   Receive      19:26:40:106      Data frame      Standard frame      00000351      8      0c 02 40 06 c4 09 ce 01       

battery charge voltage = UINT16 - Little Endian (BA) 0x1302 = 53.1V 0x0c02 = 52.4V

40 06 = 160.0Amp charge max, c4 09 = 250.0Amp discharge max ce 01 = 46.20V min. voltage

Plotting all those on a graph produces this. chart

0x356 - Battery voltage - Voltage / Current / Temp

6810   Receive      19:24:01:620      Data frame      Standard frame      00000356      6      26 13 47 ff 2c 01       
7689   Receive      19:26:40:137      Data frame      Standard frame      00000356      6      24 13 8a ff 2c 01       

0x2613 =49.02V battery voltage 0x47ff = -185Amp current 0x2c01 = 30.0 temperature C

0x2413 = 49.00V 0x8aff = -118Amp current 0x2c01 = 30.0 temperature C

0x035c

6806   Receive      19:24:00:553      Data frame      Standard frame      0000035C      1      c0       

The only CANBUS command which can request a force a charge is "0x35c", which is a single byte. Every packet of data in the log file you recorded is 0xc0 = 11000000 in binary, this translates to charging is allowed & discharge is allowed. The controller doesn't actually have the code written to force a charge on the PylonTech emulation!

So in summary, the data the controller appears to send to the inverter is correct. The controller never asks for a "force charge" (it has no way to do so).

However, this doesn't answer why the controller stops CAN comms at "19:37:08:772" in the log file.

Unfortunately, for that, I'm going to need the serial console output from the ESP32 when this problem occurs.

stuartpittaway commented 1 year ago

Just to add, looking at the 0x305 messages received by the controller from the inverter,

10305 Receive      19:34:31:535      Data frame      Standard frame      00000305      8      00 00 00 00 00 00 00 00
10360 Receive      19:34:41:888      Data frame      Standard frame      00000305      8      00 00 00 00 00 00 00 00
11161 Receive      19:37:10:371      Data frame      Standard frame      00000305      8      00 00 00 00 00 00 00 00
11162 Receive      19:37:10:425      Data frame      Standard frame      00000305      8      00 00 00 00 00 00 00 00
11163 Receive      19:37:38:506      Data frame      Standard frame      00000305      8      00 00 00 00 00 00 00 00
11164 Receive      19:37:47:171      Data frame      Standard frame      00000305      8      00 00 00 00 00 00 00 00

It does look very suspicious that two messages are received at "19:37:10" in quick succession, and then no more CAN BUS data is sent from the controller. Hence the need for the serial console output - I suspect there is a software code bug.

deekoowee commented 1 year ago

Battery was discharging.

Here's the current graph from same period diybms current

I think, I can log the serial console output directly to file with Putty. I'll put it online together with canbus logger, then I have to wait. Is this log style ok ? putty

deekoowee commented 1 year ago

Hi @stuartpittaway Here's the serial log. I cutted it from the original log file. First error can be found in line 1289

serial log.txt

stuartpittaway commented 1 year ago

Thanks for this @deekoowee. I've so far, not received any other reports of this bug.

I'm wondering if @atanisoft could shed any light on this CAN issue?

atanisoft commented 1 year ago

@stuartpittaway From the log file above:

E (175194290) diybms: Fail to queue CANBUS message (0x103)

The error code 0x103 is ESP_ERR_INVALID_STATE which indicates the bus is not in a state where it can transmit. From a very quick review of the diyBMS code I don't see any error handling for bus-off conditions which can occur due to external factors and leaves the CAN controller in a state where TX will fail. You will need to add some checks similar to this (replace can -> twai and it should work). While the bus is in an off or recovery state TX is not possible and should be delayed. Once it recovers though you should be able to TX the packet(s) as normal.

stuartpittaway commented 1 year ago

@deekoowee I've followed @atanisoft code/advice.

Can I ask you to try out the next version of DIYBMS controller code?

If you go to this link... https://github.com/stuartpittaway/diyBMSv4ESP32/actions/runs/5119711405

download the file DIYBMS-Release-Artifact-2023-05-30-09-18

deekoowee commented 1 year ago

Hi, Yes, I can make testing for that version. Do you need canbus & serial logs ?

stuartpittaway commented 1 year ago

The serial logs would be useful to see what errors occur, and how the controller deals with them. Thanks

deekoowee commented 1 year ago

The controller has updated with next version code and logging is activated. Now just wait ...

deekoowee commented 1 year ago

Here's the serial log

serial log.txt

stuartpittaway commented 1 year ago

Thank you @deekoowee

So the code I added detected that the CANBUS went into an error state (recovery needed). The code automatically attempted to recover (Initiating recovery) however it didn't appear to work, and the CANBUS entered the STOPPED state.

@atanisoft is recovery normally possible? Once in STOPPED state, can I re-start ?

D (44203473) diybms: Sent CAN message 0x351
D (44203494) diybms: Sent CAN message 0x355
D (44203514) diybms: Sent CAN message 0x356
E (44203534) diybms: Fail to queue CANBUS message (0x103)
I (44203534) diybms: CAN STATUS: rx-q:0, tx-q:0, rx-err:0, tx-err:128, arb-lost:13, bus-err:44, state: OFF / RECOVERY NEEDED
W (44203539) diybms: Initiating recovery
E (44203562) diybms: Fail to queue CANBUS message (0x103)
I (44203562) diybms: CAN STATUS: rx-q:0, tx-q:0, rx-err:0, tx-err:0, arb-lost:13, bus-err:44, state: STOPPED
E (44203585) diybms: Fail to queue CANBUS message (0x103)
I (44203585) diybms: CAN STATUS: rx-q:0, tx-q:0, rx-err:0, tx-err:0, arb-lost:13, bus-err:44, state: STOPPED
E (44204589) diybms: Fail to queue CANBUS message (0x103)
I (44204589) diybms: CAN STATUS: rx-q:0, tx-q:0, rx-err:0, tx-err:0, arb-lost:13, bus-err:44, state: STOPPED
E (44204613) diybms: Fail to queue CANBUS message (0x103)
I (44204613) diybms: CAN STATUS: rx-q:0, tx-q:0, rx-err:0, tx-err:0, arb-lost:13, bus-err:44, state: STOPPED
E (44204636) diybms: Fail to queue CANBUS message (0x103)
I (44204636) diybms: CAN STATUS: rx-q:0, tx-q:0, rx-err:0, tx-err:0, arb-lost:13, bus-err:44, state: STOPPED
E (44204659) diybms: Fail to queue CANBUS message (0x103)
atanisoft commented 1 year ago

is recovery normally possible? Once in STOPPED state, can I re-start ?

Yes, it should be possible. The call to twai_initiate_recovery will start the recovery (as you have seen) and it seems instead of returning to a running state like the old CAN driver did it goes into a STOPPED state. You can add one more else if clause for TWAI_STATE_STOPPED and call twai_start and then try sending the message.

You could also use the alerts mechanism to have it tell you when certain events occur so you can initiate recovery / restart / etc. I've shifted to the lower level TWAI HAL APIs and when TWAI_HAL_EVENT_BUS_RECOV_CPLT is raised I immediately trigger the bus restart.

deekoowee commented 1 year ago

Controller shows this;

Capture

What does it mean ?

stuartpittaway commented 1 year ago

Controller shows this; What does it mean ?

This is a count of errors for RECEIVING from the inverter - the "heart beat" which the inverter sends the controller about every 10 seconds. This error counter is incremented when a message isn't received (such as when the CANBUS is stopped/in error).

stuartpittaway commented 1 year ago

@deekoowee thanks for trying out the various attempts to get this working for you. Heres the latest one.

https://github.com/stuartpittaway/diyBMSv4ESP32/suites/13292392686/artifacts/725613861

@atanisoft thanks for the help.

deekoowee commented 1 year ago

Latest one uploaded to controller. Logging on and just waiting...

deekoowee commented 1 year ago

Controller shows this; Capture

4 errors and controller has passed them without a problem.

stuartpittaway commented 1 year ago

Great, so it looks like the recovery on the CANBUS works.

I still think there is a software fault in the inverter, but if this helps then great.