[Timing] Overview of timing issues per plugin

TD-er commented 6 years ago

Starting with build mega-20180708 some timing statistics will be logged. These can only be seen when looking at the log via serial port. (web log view will be added too, but for now doesn't work, that's another issue) Loglevel should be at Debug level.

These logs will indicate the min/avg/max durations in usec used by several calls to plugins. A few examples already shown here.

70772995 : LoopStats: shortestLoop: 46 longestLoop: 1058298 avgLoopDuration: 90.23 systemTimerDuration: 25.35 systemTimerCalls: 31 loopCounterMax: 652173 loopCounterLast: 324928 countFindPluginId: 0
70773002 : PluginStats P_22_Extra IO - PCA9685 READ                 Count: 1 Avg/min/max 15.00/15/15 usec
70773011 : PluginStats P_22_Extra IO - PCA9685 ONCE_A_SECOND        Count: 30 Avg/min/max 15.87/11/27 usec
70773020 : PluginStats P_22_Extra IO - PCA9685 TEN_PER_SECOND       Count: 300 Avg/min/max 12.05/3/38 usec
70773029 : PluginStats P_22_Extra IO - PCA9685 WRITE                Count: 10 Avg/min/max 453.00/387/796 usec
70773039 : PluginStats P_22_Extra IO - PCA9685 FIFTY_PER_SECOND     Count: 1489 Avg/min/max 15.60/10/46 usec
70773048 : Plugin call 50 p/s   stats: Count: 1489 Avg/min/max 587.28/504/922 usec
70773056 : Plugin call 10 p/s   stats: Count: 300 Avg/min/max 538.97/500/857 usec
70773063 : Plugin call 10 p/s U stats: Count: 300 Avg/min/max 3551.89/3204/3978 usec
70773070 : Plugin call  1 p/s   stats: Count: 30 Avg/min/max 9547.20/1782/203298 usec
70773078 : checkSensors()       stats: Count: 30 Avg/min/max 882.50/637/2346 usec
70773085 : WD   : Uptime 1179 ConnectFailures 2 FreeMem 13424

Some help in reading this: Data is collected over 30 seconds time and after that this report is generated. For example 300 calls to "Plugin call 10 p/s" is just perfect. To be able to serve all calls to the "Plugin call 50 p/s", no other call should take more than 20 msec (20'000 usec) An occasional peak is not really a problem, but average values over 20 msec may prevent calls to 1/50th sec calls in due time. The same for calls taking over 100 msec, which may prevent 1/10th second calls.

Not many plugins use the 1/50th call, but when they do, they are probably time critical.

The idea of this issue is to collect data on plugins that may exceed these timing restrictions. For example reading the buffer from software serial is rather timing critical.

I will later add the same stats to calls to controllers. For now only plugin (task) calls are processed.

s0170071 commented 6 years ago

Thats exellent. I spend a lot of time debugging those timing issues.

I found it to be good practice to issue a debug message from the plugin if it was called later than expected. E.g. my MT681 plugin (reads from serial) notifies me if it was last called more than 100ms ago. This way any malicious behaviour of a plugin is preceeded by a timing violation warning. This gives a very valuable hint that the source of the troubles is not where it appears.

micropet commented 6 years ago

Here again a log with PMS7003 and MH-Z19. The MHZ19 often has a read error. The PMSx003 has an invalid framelength

device_page_1

ESP-206 with PMS7003.zip

TD-er commented 6 years ago

Hmm, strange I don't see the PMSx003 stats in the log. Maybe those lines appear too fast to send to the logserver? You should be able to see them in the serial log output.

micropet commented 6 years ago

Why should only the stat messages from the PMS be too fast? The MH-Z19 is connected to the serial port.

TD-er commented 6 years ago

It is not that those are too fast, but more that the burst of log lines may be too fast which leaves a number of lines left out. The order of these lines is always the same and I guess it is the order in which they were called first.

clumsy-stefan commented 6 years ago

@TD-er: just because you mention time-criticality of software serial buffer readout in your first post... I see on my nodes, that the two sonoff-pow restart much more frequently (between 60sec. and 5hours) than the rest of the nodes... this could point to an issue in (my) _P077 plugin, which basically reads software serial whenever it can... I couldn't find a (the) issue yet, but if someone else could have a closer look at it, it would probably help to identify issues... especially with the new timing statistics....

TD-er commented 6 years ago

You could check if the reboot cause was due to the watchdog timer. (software watchdog is also possible) The reboot cause is shown in the sysinfo page.

clumsy-stefan commented 6 years ago

Pow01: Boot: Manual reboot (17) Reset Reason: Exception That's from 52Min. ago..

Pow02: Boot: Manual reboot (53) Reset Reason: Exception That's from 32Min. ago..

micropet commented 6 years ago

I have never reboot the Device manual.

s0170071 commented 6 years ago

@clumsy-stefan I did have a look at your plugin, but couldn't spot an error. Still, I got some thoughts on it:

I sucessfully used the 50persecond case to read from serial. Just don't do anything blocking there.
check for the serial.available value. 128 means you may have a buffer overflow. Discard any serial data then.
You're not the only one with reboots at the moment so there may be something else. Try a debug statement (write directly to serial) each time you enter or exit your code. Also try to run a blank unit with no plugins and see if this one runs in your environment.
Especially when playing with serial read, I found it good practice to monitor of your code was called in time. If some other task delays you start missing data.

To monitor how long ago your code was called try this: global variable: unsigned long lastInvocation; in your reading code:


     unsigned long now = millis(); 
     unsigned int numberOfbytes=0; 
     if ((now - lastInvocation) > 200) 
     {
      logStringEHZ = F("MT681: last invocation was "); 
      logStringEHZ += (now - lastInvocation); 
      logStringEHZ += F("ms ago - that's too long!"); 
      addLog(LOG_LEVEL_INFO, logStringEHZ);
     }

     lastInvocation=now; 
     if (!Serial.available()) return(0);  // nothing to receive

     while (Serial.available()) {
         if ((millis()- lastInvocation ) >5) return 0; // time is up. Return control to system
         buffer[bufferPtr++] = Serial.read();

... do your stuff.

clumsy-stefan commented 6 years ago

@s0170071 I see your points, thanks for refelcting, some thoughts on them:

even if there is a buffer overflow it would eman it can't calculate the data but the plugin buffer's cant overflow because of thet. however, I'll add the check then.
yes, all my units reboot from time to time... the pow's just reboot much more often than others. and there are clean units, only running switch and energy plugin...
agree. but why should "missing data" lead to a reboot? I'll double check the conversion routines, as this can be the only point I think, where "wrong" data can cause some memory violation or similar... I'll add some further debug code... just hadn't have the time yet...

clumsy-stefan commented 6 years ago

@s0170071 just tried to read a bit about Serial communication, what I can't find is any documentation that Serial.available will return 128 on buffer overflows... Especially as SoftwareSerial uses a ringbuffer of 64 bytes and discards everything after that. can you point me to some useful information on such an implementation?

also I always return control to system directly after reading one byte, so why should there be a need to check if the plugin runs more than 5milis (in your example)? Thanks for any explanation on this, it seems I'm missing something currently...

TD-er commented 6 years ago

@clumsy-stefan I also want to know a well working implementation on the state of the serial buffer. I've looked into that a while ago, to stop writing to the serial port for logging purposes when the data isn't read from it. The thing is the ESP will get noticeable slower when the serial port logging is still enabled and nothing is reading from it.

clumsy-stefan commented 6 years ago

I found at least one small issue in the P077 plugin which is even when serial.available is not >0 calculations are done, instead of just returning from the function. However this does not seem to be the issue for the (more) frequent reboots...

s0170071 commented 6 years ago

@clumsy-stefan The ESP has a 128 byte hardware fifo. As of core version 2.4 there is also a software buffer to extend the hardware buffer. You can set the size of that to fit your needs. Nonetheless an overflow will happen if the buffer is full. This does not crash the ESP but you will lose data. Once you start losing data, there is a high probability that the remaining data in the buffer is either old or incomplete. What I usually do in this case is read() until there is no more data left (discard buffer)

About the 5ms timeout: Wether you need this or not depends on how fast your data arrives and what you do with it. If you get it at 230kbps and bursts of 5kbytes and you do a yield now and then you may end up blocking the system longer than you should. Same if you wait for a data frame to complete (read in a for loop)

Both is no problem in your code- I just happen to do it in mine as I read bursts of 500bytes with 9600bps which would take 500ms. Anyway, if you don't want to calculate how long your code takes to execute this is the way to make sure it behaves. If everyone did it @TD-er would not have to tackle the timing issues he currently faces.

There have also been numerous bugs in the core assiciated with the serial buffer. It somehow shares an interrupt routine with a timer I think... But from what I can tell, they are all fixed in the current git version. At least I do not have problems at the moment with it.

Also a good hint is to monitor the available memory. Low memory is a nasty source of errors... Make sure there are at least 3k free at all times. Thats how much the IP stack requires.

s0170071 commented 6 years ago

@TD-er what do you think about enforcing a correct timing behaviour for plugins ? If a plugin was trapped executing longer than it should, e.g. 10ms, it gets flagged and deactivated....

That would trigger some hickups in some of the plugins at first, but eventually that problem would be gone for good...

That could be combined by a flag/checkbox in which the user confirmes that he wants to exempt a/all plugin from timing enforcement with all the consequences associated with it (warranty and support void...)

Grovkillen commented 6 years ago

@s0170071 that's a pretty nice idea in fact. Let's see what backend chief is saying :)

TD-er commented 6 years ago

Enforcing may not be the best way I guess. Flagging and putting a notification in the web gui could help find issues sooner and help getting a better direction in fixing it.

TD-er commented 6 years ago

I was just thinking @s0170071 could you make a separate issue about the serial interface knowledge mentioned here into a separate issue. I will make a new project for these and keeping knowledge on-topic would be nice :)

s0170071 commented 6 years ago

@TD-er flagging works for me. As long as the flag is sticky. What do you mean by "separate issue." and "I will make a new project for these and keeping knowledge on-topic would be nice :)" Was I too off-topic ?

TD-er commented 6 years ago

Nope, you were absolutely not off-topic, but the information is so valuable, I don't want to miss it. And in Github, you can create "projects" and assign issues to it to keep track. So when you put that into a new issue, I can track it more easily. The project will be about "blocking communications" I guess. So maybe we should make some abstraction layer dealing with those communications and let the plugin just communicate with that layer. Then we only have to deal with the timing issues in one place, instead of any function trying to do the same thing.

clumsy-stefan commented 6 years ago

@s0170071 thanks... Agree with everything....

just had another look in the P077 plugin and I think I found a (rare) occuring situation where it could run into a buffer-overflow (in one of the plugin variables), so rearranged some of the code and made sure this can't happen anymore (also fixed some other small things)... will test it some more and, if it stays stable, open a pull request with the changes...

TD-er commented 4 years ago

It is even documented, so close.

letscontrolit / ESPEasy

[Timing] Overview of timing issues per plugin #1557