Closed Argafal closed 1 year ago
MQTT has stopped reported production values now and only reports uptime. Here are two pages of recent serial output. Right at half way through the log the pattern of messages changes, related? I don't know the ahoy code well enough to understand what causes the change in output.
I have now rebooted ahoy.
I: Payload (62): 00 01 01 63 01 43 01 3B 04 7B 04 5E 00 04 65 72 00 03 F8 15 01 1E 01 19 01 70 01 BB 01 F2 06 5B 07 26 00 04 75 6A 00 04 7C 83 01 64 01 8F 09 4D 13 8A 15 3B 00 DD 00 E4 03 E7 00 F7 00 02
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch61 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 43 D7 00 00 00 01 00 00 00 00 DB 0D 45
W: (#2) nothing received
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch75 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 43 F5 00 00 00 01 00 00 00 00 7A 8D 5A
W: (#0) nothing received
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch3 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 44 13 00 00 00 02 00 00 00 00 C4 9A 1E
RX whatHappened Ch61 001
RX whatHappened Ch61 001
RX whatHappened Ch40 001
RX whatHappened Ch40 001
I: RX 27B Ch61 | 95 74 40 35 30 74 40 35 30 01 00 01 01 64 01 3F 01 37 04 6D 04 51 00 04 65 75 D0
I: RX 27B Ch61 | 95 74 40 35 30 74 40 35 30 02 00 03 F8 18 01 21 01 1C 01 70 01 B5 01 EC 06 47 20
I: RX 27B Ch40 | 95 74 40 35 30 74 40 35 30 03 07 12 00 04 75 6E 00 04 7C 87 01 68 01 93 09 4B DA
I: RX 27B Ch40 | 95 74 40 35 30 74 40 35 30 84 13 8A 14 FD 00 DC 00 E2 03 E7 00 F6 00 02 AC 21 C2
I: procPyld: cmd: 0xb
I: procPyld: txid: 0x95
I: Payload (62): 00 01 01 64 01 3F 01 37 04 6D 04 51 00 04 65 75 00 03 F8 18 01 21 01 1C 01 70 01 B5 01 EC 06 47 07 12 00 04 75 6E 00 04 7C 87 01 68 01 93 09 4B 13 8A 14 FD 00 DC 00 E2 03 E7 00 F6 00 02
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch23 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 44 31 00 00 00 01 00 00 00 00 65 5E 49
W: (#2) nothing received
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch40 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 44 4F 00 00 00 01 00 00 00 00 C7 77 A0
W: (#0) nothing received
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch61 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 44 6D 00 00 00 02 00 00 00 00 66 B3 EB
RX whatHappened Ch23 001
RX whatHappened Ch3 001
RX whatHappened Ch3 001
RX whatHappened Ch3 001
I: RX 27B Ch23 | 95 74 40 35 30 74 40 35 30 01 00 01 01 63 01 40 01 38 04 71 04 54 00 04 65 78 B3
I: RX 27B Ch3 | 95 74 40 35 30 74 40 35 30 02 00 03 F8 1B 01 24 01 1F 01 70 01 B6 01 ED 06 4B 2B
I: RX 27B Ch3 | 95 74 40 35 30 74 40 35 30 03 07 16 00 04 75 72 00 04 7C 8B 01 6C 01 97 09 4B CE
I: RX 27B Ch3 | 95 74 40 35 30 74 40 35 30 84 13 87 15 0B 00 DC 00 E2 03 E7 00 F7 00 02 45 E0 11
I: procPyld: cmd: 0xb
I: procPyld: txid: 0x95
I: Payload (62): 00 01 01 63 01 40 01 38 04 71 04 54 00 04 65 78 00 03 F8 1B 01 24 01 1F 01 70 01 B6 01 ED 06 4B 07 16 00 04 75 72 00 04 7C 8B 01 6C 01 97 09 4B 13 87 15 0B 00 DC 00 E2 03 E7 00 F7 00 02
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch75 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 44 8B 00 00 00 01 00 00 00 00 02 15 DF
W: (#2) nothing received
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch3 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 44 A9 00 00 00 01 00 00 00 00 A3 95 C0
W: (#0) nothing received
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch23 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 44 C7 00 00 00 02 00 00 00 00 C1 35 60
RX whatHappened Ch61 001
RX whatHappened Ch61 001
RX whatHappened Ch40 001
I: RX 27B Ch61 | 95 74 40 35 30 74 40 35 30 02 00 03 F8 1D 01 26 01 21 01 70 01 B6 01 ED 06 48 12
I: RX 27B Ch61 | 95 74 40 35 30 74 40 35 30 03 07 14 00 04 75 76 00 04 7C 90 01 70 01 9C 09 4E C1
I: RX 27B Ch40 | 95 74 40 35 30 74 40 35 30 84 13 89 14 F4 00 E2 00 E1 03 E7 00 F7 00 02 60 50 49
W: Frame 1 missing: Request Retransmit
I: TX 11B Ch40 | 15 74 40 35 30 86 99 51 75 81 9E
RX whatHappened Ch75 001
I: RX 27B Ch75 | 95 74 40 35 30 74 40 35 30 01 00 01 01 62 01 3C 01 38 04 60 04 52 00 04 65 7A DB
I: procPyld: cmd: 0xb
I: procPyld: txid: 0x95
I: Payload (62): 00 01 01 62 01 3C 01 38 04 60 04 52 00 04 65 7A 00 03 F8 1D 01 26 01 21 01 70 01 B6 01 ED 06 48 07 14 00 04 75 76 00 04 7C 90 01 70 01 9C 09 4E 13 89 14 F4 00 E2 00 E1 03 E7 00 F7 00 02
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch61 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 44 E5 00 00 00 01 00 00 00 00 60 F1 37
W: (#2) nothing received
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch75 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 45 03 00 00 00 01 00 00 00 00 C1 42 DE
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch3 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 45 21 00 00 00 02 00 00 00 00 60 86 95
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch23 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 45 3F 00 00 00 01 00 00 00 00 C0 43 FE
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch40 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 45 5D 00 00 00 01 00 00 00 00 A2 F2 53
I: enqueued cmd failed/timeout
I: (#1) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch61 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 45 7B 00 00 00 02 00 00 00 00 C3 04 EE
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch75 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 45 99 00 00 00 01 00 00 00 00 67 90 2C
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch3 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 45 B7 00 00 00 01 00 00 00 00 C6 45 6A
I: enqueued cmd failed/timeout
I: (#1) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch23 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 45 D5 00 00 00 02 00 00 00 00 A4 B0 93
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch40 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 45 F3 00 00 00 01 00 00 00 00 C5 46 32
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch61 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 46 11 00 00 00 01 00 00 00 00 6E 66 44
I: enqueued cmd failed/timeout
I: (#1) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch75 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 46 2F 00 00 00 02 00 00 00 00 0F 3A 4B
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch3 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 46 4D 00 00 00 01 00 00 00 00 6D CF AE
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch23 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 46 6B 00 00 00 01 00 00 00 00 0C 7D 47
I: enqueued cmd failed/timeout
I: (#1) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I think we have some kind of memory problem. The website issue comes up because the API only returns null
. There is no recover from this state and happens to me from time to time.
The log shows that there is data received RX
but suddenly there will be no new data.
Yes, I agree. I documented an OOM in issue #644, may or may not be related.
How can I help to narrow this down further? Is building the esp8266-debug target the right way of getting more debug output? (Building this target currently fails for me, so I can't try it yet, see issue #653)
Saw both of the related issues. Don't know about that, maybe digging deeper the next days.
I try to reduce code which dynamically allocates memory, so yesterday I removed another new
. I think it cannot be solved with one commit, it's more like a process over several of commits.
You can try to review the code and check for memory leaks. Maybe you find some section where the code can be optimized to reduce memory.
We use several subroutines which enlarges the stack once they're called. Maybe we have to think about a desing which is more flat
without a deep hierachy.
@Argafal in short: dont't know if it is related to the problem, but I saw two strange things in your log file:
@beegee3 Much appreciate that you commented on the log file. The first point you make is most curious. Maybe @lumapu understands it better?
About the second point, I recently changed retransmits from 5 to 50. The reason was: I noticed that if any inverter didn't come online within the first 10-15 minutes after boot it never seems to come online. The only thing that helps is a reboot. I assumed that this means 5 connection attempts failed and ahoy has permanently given up on that inverter. Increasing the number of attempts to infinite (or in my case 50) seemed the smart thing to do: I know the inverter is there and producing, thus: "Ahoy, please keep on trying." Do I misunderstand what retransmits actually does/means?
@beegee3 your point 1. is the same I was about to make after looking thoroughly at the log file. The inverter sent a complete payload and Ahoy should process that
iv->addValue(i, payload, rec);
...
iv->doCalculations();
notify(mPayload[iv->id].txCmd);
...
iv->setQueuedCmdFinished();
decode and add the values to the internal inverter structure using memcpy do some calculations on various composite metrics and notify other parts of the code about the changes it received and processed.
But then instead of finishing the enqueuedCmd with success, it complains about the command being failed/timeout ?
See here: https://github.com/lumapu/ahoy/blob/development03/src/hm/hmPayload.h#L261-L322
@Argafal regarding Retransmits & Retries, it has been shown in the earliest HackRF traces between Hoymiles DTU WLite and HM-600 inverters that the NRF24L01+ in the inverter is only sending 15 Retries of the same packet in sequence.
But I guess you have modified the Retransmits, which is the number of times the DTU will send another 0x8N packet to retrieve the Nth packet of the complete message. So here you will remain in the same Send loop for 50 iterations and try on receiving any of the up to 4 packets each complete message / payload can be made up.
According to the logfiles you posted via the Discord / Pastebin I have seen that one of your inverters has some difficulty getting the message over without Retransmits.
I am a bit indifferent whether this is a smart / bad thing to do, if it helps :+1:
However there might be difficulties in the order of the packets that the Hoymiles inverter expects in a new communication with the DTU. We may not have the correct order yet, so this may contribute to the success of reaching the inverter.
In the past we used 0x0B
RealTimeRunData_Debug (4 packets) first and now we ask for 0x01
InverterDevInform_Simple (should be only 1) in the beginning.
I have experienced this as well. For what it's worth, I was able to reboot ahoy remotely by calling
@Argafal @stefan123t delving deeper into the programming I found the explanation for 1.: the timeout message belongs to the next inverter (no. 2). It's here: https://github.com/lumapu/ahoy/blob/development03/src/hm/hmPayload.h#L125-L129 Inverter no. 1 enqueuedCmd ended successfully and the process on inverter no. 2 started. This one hasn't answered before (cf. a few lines above "(# 2) nothing received").
The timeout message is confusing because of the missing inverter no. and some hint like "last time ...", but I think @lumapu liked to replace lines 125 and 126 with 127 - 129 and just forgot to delete the first two lines 😆
Sorry for wasting your time 😒
@beegee3 @Argafal I gave the timestamps in the sendTimePacket(s) some more attention:
# 63 E2 43 D7
$ date --date="@$(echo 'ibase=16; 63E243D7'|bc)" '+%F %H:%M:%S'
2023-02-07 13:28:07
# 63 E2 44 E5
$ date --date="@$(echo 'ibase=16; 63E244E5'|bc)" '+%F %H:%M:%S'
2023-02-07 13:32:37
But this seems to have been somewhere around lunch time, so the inverters should have eventually received enough sunshine.
I can rule out to little sun light as a part of the problem. Super sunny the last two days.
However, I think you are onto something with the retransmits/intervals. With retransmits=5, interval=30s I waited a long time for the inverters and sometimes never caught them. With retransmits=50, interval=30s it was probably the same. Maybe worse, hard to measure. I now use retransmits=5, interval=10s, and boy that seems to have made for a big improvement. Inverters are now found much more reliably and quickly upon boot-up. Is there a reason to have a 30s default? @Knickohr what is your interval time?
Related question: does ahoy ever give up on an inverter entirely, and if so, when?
On the negative side, whilst I now quickly find the inverters the webserver breaks away much more quickly than before. I don't make it through even a few minutes now before the website becomes an empty template (screenshot above). If there is anything to test or any additional debug to activate to narrow the issue down, I now seem to reproduce the problem easily.
Additional symptom: I notice that the website occasionally recovers after a minute or two.
I‘m still at default 30s interval time.
@Argafal interval describes the time period between two requests. Since a loop iterates over the inverters it needs (number of inverters) interval seconds that the same inverter is requested again (with your current setting 3 10 = 30 seconds). The default setting interval=30 comes from older times where the logic was different in a way. retransmits is the maximum number of single requests if a frame is missing. That is the inverter has a counter in his answers (=frames) starting with 1. In the last frame 0x80 is added to the counter. That way it is easy to detect missing frames. These will be requested again immediately (up to x times with x = your retransmits value). If there is no answer at all there is no retransmit.
@Argafal please check again with latest version 0.5.89
I confirm that this issue still occurs with 0.5.89.
@Argafal How are you using Ahoy? Is more than one client accessing the website (maybe browser + a script collecting values from the API) and are you writing limit to the inverter using either MQTT or the API?
I have run into the same issue with a single inverter and multiple clients, but you don't seem to be hammering the website nearly as much as I was at the time.
It might not be a memory leak as such, but more an issue with memory fragmentation that can occur if one is creating and destroying larger String objects. Dynamically allocating and releasing blocks of memory will lead to 'holes' in the heap that are simply too small the accommodate the Strings that you're trying to put in there. I'm guessing that's also why the web server is returning 'null'. It can't allocate a large enough block of memory to hold the response. That would also explain why the site sometimes receivers, if a larger block of memory is released elsewhere.
Quick primer on the subject: https://cpp4arduino.com/2018/11/06/what-is-heap-fragmentation.html
To diagnose the problem, the same site links to a code fragment that's able to check the size of the largest available block of memory and calculate the level of fragmentation on and ESP8266. It may be wise to include this piece of code to get a better idea of what's causing the issue.
As a 'quick fix' that would be very quick to implement, we may want to check whether it's possible to allocate a large enough block of memory and reset the micro if that's not the case.
With a bit more time and effort we could use the String.reserve() function for suspicious Strings to avoid much of the dynamic memory allocation.
To properly fix the problem permanently, we may have to avoid dynamically allocated String altogether.
@tastendruecker123: the latest version 0.5.89 worked on the problem you describe. It seems to be fine now🤞
@lumapu: because of @Argafal's comment
I have enabled serial output and read the serial log
I looked at the code for creating the serial log output (web.h
). Maybe there is a possible issue in the function serialCb
where the serial buffer mSerialBuf
of size WEB_SERIAL_BUF_SIZE
is filled. There a two lines if((...) <= WEB_SERIAL_BUF_SIZE) {...}
to avoid overflow errors. So it is possible that the buffer can be filled completely and does not contain a '0' indicating the end of the buffer, i.e. an overflow error occurs on sending it to the website.
In summary the two if clauses should be corrected to if((...) < WEB_SERIAL_BUF_SIZE) {...}
(delete the =
sign).
the messages at discarded or not collected. I don't think the serial buffer will lead to a problem
I have the same problem as mentioned above. I'm on the latest developer version. after +- 24 hours of working properly, a reboot takes place. Before that, the web server is half/unreachable.
should be solved with current dev version eg. 0.5.96
So cool, well done. I'd love to hear what the solution was in the end. Do you care to share? :)
main issue were strtok
functions. It generates huge heap fragmentation which leads to the problems.
Never thought that before, functionality is now the same but working around strtok
Great work, @lumapu!
Problem ist scheinbar doch wieder aufgetaucht.... Schade.
Könnten Betroffene (und auch Nicht-Betroffene) bitte kurz diese Fragen beantworten:
10
Absturz wäre der falsche Begriff: WLAN Verbindung für WEB UI geht nicht mehr (MQTT Datentransport funktioniert weiterhin). Nach WLAN reconnect (via WLAN Überwachung wird der AHOY zum reconnect gezwungen) ist WEB UI wieder normal bis zum nächsten "Problem" erreichbar.
Meiner Einschätzung nach existiert ein Problem mit dem Port des Webservers der UI.
Running 0.5.80 on an ESP8266. This problem was also present in 0.5.78 (and maybe before).
Error description: the website stops working. There is no obvious user interaction or inverter condition that I can identify as a possible cause of the problem. The problem shows like this:
MQTT data keeps coming, so AHOY seems to be running just fine. I do not get any error messages related to the webserver on the serial console. This might be related to #645, however in #645 it is reported that the website is not reachable at all, whereas I still see parts of it.