openvehicles / Open-Vehicle-Monitoring-System-3

Open Vehicle Monitoring System - Version 3
http:///www.openvehicles.com/
Other
602 stars 228 forks source link

Duktape: Randomly unable to access functions #474

Closed mjkapkan closed 2 years ago

mjkapkan commented 3 years ago

Recently my scripts started crashing after running 10-15 minutes. This is what I see in the logs: E (6840862) script: [int/PubSub.js:1] TypeError: not object coercible

For now I made an exception for this as it looks like it happens randomly and then works again. But this might cause some more issues if we want to have a reliable JS engine.

EDIT: It looks like it does not help. It also seem to cause the unit to crash:

Backtrace:
0x4008e4f2 0x4008e78d 0x400e61e4 0x40083fce
Event: ticker.1@EventScript 0 secs
WDT tasks: IDLE1
Version: 3.2.015-155-g14c18286/ota_0/edge (build idf v3.3.4-845-gd59ed8bba Nov 13 2020 11:03:01)
dexterbg commented 3 years ago

I've never seen that before. Possibly an out of memory situation?

Duktape has been running crash free over months for me. Only crashes recently have been due to lack of memory, probably due to long network disconnects filling the notification buffers.

mjkapkan commented 3 years ago

Hm, did not notice any more disconnects than before. Also no changes in the scripts I have been using. My module is without modem so I only use WIFI for connectivity and the settings and location of APs have not changed as well.

So there seems to be 2 problems.

  1. Crashing
  2. PubSub.subscribe() with OvmsEvents.Raise() stop evoking the script after some time and throwing this error in logs: E (6840862) script: [int/PubSub.js:1] TypeError: not object coercible

What would be the best way to see the state of memory before the crash?

dexterbg commented 3 years ago

The TypeError may well be due to out of memory. There's a bunch of memory monitoring options in the module commands.

mjkapkan commented 3 years ago

Thanks @dexterbg will try that out. The crash of the module is not that big of an issue. A bigger problem is the crashed PubSub.js where the script just stops working.

KommyKT commented 3 years ago

@dexterbg I see in logs the following : script: [lib/pwrmon.js:52] TypeError: object required, found undefined (stack index -1)

mjkapkan commented 3 years ago

Was just able to replicate it again, seems like memory is not an issue: I (60283515) webcommand: HttpCommandStream[0x3f82d190]: 3830008 bytes free, executing: power ext12v status E (60283935) script: [int/PubSub.js:1] TypeError: not object coercible I (60285405) webserver: HTTP POST /api/execute I (60285455) webcommand: HttpCommandStream[0x3f82d27c]: 3828556 bytes free, executing: power ext12v status

dexterbg commented 3 years ago

You're both running a recent build using the fixed toolkit release 97?

mjkapkan commented 3 years ago

I am running OTA on edge: Version: 3.2.015-155-g14c18286/ota_0/edge (build idf v3.3.4-845-gd59ed8bba Nov 13 2020 11:03:01)

KommyKT commented 3 years ago

I am running: Toolchain path: /Users/kommykt/esp/xtensa-esp32-elf/bin/xtensa-esp32-elf-gcc Toolchain version: crosstool-ng-1.22.0-97-gc752ad5 Compiler version: 5.2.0 App "ovms3" version: 3.2.015-172-g0e346a41

Firmware: 3.2.015-172-g0e346a41/ota_0/eap (build idf v3.3.4-845-gd59ed8bba Nov 25 2020 17:04:38)

dexterbg commented 3 years ago

Which JS plugins do you use? Have you tried enabling them separately to see if one causes the issue?

mjkapkan commented 3 years ago

I have 2 scripts, both seem to stop updating after some time. But the one that executes each 3 seconds (heatassist) seems to stop quicker than the one executing each 20 seconds (cctimer). Here are the scripts I use: https://github.com/mjkapkan/OVMS-Plugins

mjkapkan commented 3 years ago

Created a separate script to debug this, it prints the number of loops it made in the log. Below are my tests with different timeouts between loops. What I noticed is that the script tends to run longer after module reboot compared to just "script reload". What I also noticed is that it kind of depends how often the script is invoked. For example if I subscribe to an event each 1 second it tends to crash much faster than if each 30 seconds. I have alsto tested it with empty vehicle to make sure it is not caused by NL vehicle code - same result.

10ms: I (7238723) script: [lib/test.js:20:assistEngage] TestCrash:1923 E (7238793) script: [int/PubSub.js:1] TypeError: not object coercible (After restart)

I (59112) script: [lib/test.js:20:assistEngage] TestCrash:443 E (59202) script: [int/PubSub.js:1] TypeError: not object coercible this one also caused a module crash: Last crash: abort() was called on core 0 Backtrace: 0x4008e4f2 0x4008e78d 0x40084833 0x40084959 0x4011af39 0x4010d18b 0x4010cc1a 0x40093673 0x400842e1 0x4008473d 0x4000bec7 0x401cad79 0x400f898d 0x400923ef

I (374802) script: [lib/test.js:20:assistEngage] TestCrash:727 E (374842) script: [int/PubSub.js:1] TypeError: not object coercible also crashed: Last crash: abort() was called on core 0 Backtrace: 0x4008e4f2 0x4008e78d 0x400e61e4 0x40083fce WDT tasks: IDLE1

I (301232) script: [lib/test.js:20:assistEngage] TestCrash:2552 *after restart module stalled after this, had to unplug from power to make it work again.

100ms: I (66203) script: [lib/test.js:20:assistEngage] TestCrash:377 E (66313) script: [int/PubSub.js:1] TypeError: not object coercible I (440143) script: [lib/test.js:20:assistEngage] TestCrash:95 E (440253) script: [int/PubSub.js:1] TypeError: not object coercible I (550853) script: [lib/test.js:20:assistEngage] TestCrash:632 E (550963) script: [int/PubSub.js:1] TypeError: not object coercible (5785073) script: [lib/test.js:20:assistEngage] TestCrash:953 E (5785193) script: [int/PubSub.js:1] TypeError: not object coercible

30000ms: No Error or crash after an hour. I (3064322) script: [lib/test.js:20:assistEngage] TestCrash:102

mjkapkan commented 3 years ago

Also, maybe unrelated but is there a way to reload the js engine from inside js? I tried passing "script reload" to OvmsCommand.Exec but I guess it leaves something behind and no longer runs properly afterwards: I (106002) webserver: HTTP POST /api/execute I (106012) webcommand: HttpCommandStream[0x3f83784c]: 3947160 bytes free, executing: script eval 'OvmsCommand.Exec("script reload")' OVMS# script eval 'OvmsCommand.Exec("script reload")' undefined W (139402) script: DuktapeDispatch: msg type 3 lost, queue full I (140102) webserver: HTTP POST /api/execute I (140112) webcommand: HttpCommandStream[0x3f837d9c]: 3935848 bytes free, executing: script reload W (140402) script: DuktapeDispatch: msg type 3 lost, queue full

dexterbg commented 3 years ago

Confirmed. I can reproduce the PubSub TypeError on my desk module just running the above test script with a 10 ms interval:

…
D (1292827) events: Signal(testevent.heartbeat)
I (1292847) script: [lib/testevents.js:27:triggerTest] testevent.: 1356
E (1293077) script: [int/PubSub.js:1] TypeError: not object coercible
…
D (2475827) events: Signal(testevent.heartbeat)
I (2475837) script: [lib/testevents.js:27:triggerTest] testevent.: 6873
E (2475857) script: [int/PubSub.js:1] TypeError: not object coercible

Memory usage isn't too high and gets cut down as expected by the regular gargabe collection run. Script run time up to the TypeError is totally random, may run for hours or minutes.

As there is no continuous interaction with the PubSub registry from the test script, this seems to occur on a lower level. Maybe it's still the PSRAM bug…

biddster commented 3 years ago

This is probably what's causing the issues I see. I raised this on the forums https://www.openvehicles.com/node/2634

I've written a slack plugin that sends engine.on, telemetry and engine off events. No purpose to it really other than to familiarise myself with the JS engine and OVMS events.

When I receive engine.on, I subscribe to a 60 second interval which publishes some telemetry to a slack channel. The interval is unsubscribed from when I receive engine.off.

It randomly doesn't work and I often might get engine.on, or engine.off, or it stops working after a few intervals. I also see a crash and restart whilst driving.

I'm just getting my OVMS feet wet so I need to sort out some persistent logging etc.

I saw on the @glynhudson ovms abrp youtube video that he said the abrp plugin was crashing so I wonder if that was the issue there too?

dexterbg commented 3 years ago

The bug isn't in PubSub, and PubSub doesn't lose the subscription: a simple event raise testevent.heartbeat is sufficient to get the tester running again. Using the stack backtrace feature I added, I found the error occurs here:

E (3249189) script: [int/PubSub.js:17] TypeError: not object coercible
……:     at [anon] (duk_api_stack.c:3661) internal
……:     at hasKeys (int/PubSub.js:17) strict
……:     at messageHasSubscribers (int/PubSub.js:72) strict
……:     at publish (int/PubSub.js:90) strict
……:     at [anon] (int/PubSub.js:111) strict preventsyield

…which is the in operator in PubSub.hasKeys():

function messageHasSubscribers( message )
  {
  var topic = String( message ),
      found = Boolean(messages.hasOwnProperty( topic ) && hasKeys(messages[topic])),    ← :72
--------------------------------------------------------------------
function hasKeys(obj)
  {
  var key;
  for (key in obj)        ← :17
--------------------------------------------------------------------
DUK_EXTERNAL void duk_to_object(duk_hthread *thr, duk_idx_t idx) {
…
    case DUK_TAG_UNDEFINED:
    case DUK_TAG_NULL: {
        DUK_ERROR_TYPE(thr, DUK_STR_NOT_OBJECT_COERCIBLE);    ← :3661

So the in operator fails because obj is no object at that time. But obj must be an object, because the method is only called if it is.

It looks like Duktape sometimes cannot read a well defined object from memory. It's not an out of memory issue, and I think a stack issue can be ruled out as well, as we build Duktape with stack checking, so we would see an error message on this.

This really smells a lot like the PSRAM bug.

dexterbg commented 3 years ago

@mjkapkan I'm afraid this will be very hard to fix.

But, as PubSub isn't broken, a simple workaround for this could be to use a standard ticker event to check if your custom ticker is still running, and reactivate it as needed by simply triggering a new custom ticker event.

mjkapkan commented 3 years ago

Thanks for looking in to this @dexterbg. By using the standard ticker you mean placing script in the events directory with corespoinding ticker name?

dexterbg commented 3 years ago

No need to attach a file script, you can simply subscribe to the event from JS. For example, subscribe to ticker.300 and check some "last run" timestamp, then fire your custom event if the time doesn't match your event period.

mjkapkan commented 3 years ago

@dexterbg do you mean like this? PubSub.subscribe("ticker.10", myFunction) OvmsEvents.Raise("ticker.10")

Cause I just tried this and the unit kept constantly rebooting: Last boot was 31 second(s) ago Time at boot: 2020-12-15 01:21:08 EET This is reset #13 since last power cycle Detected boot reason: Crash (12/12) Reset reason: Exception/panic (4) Crash counters: 13 total, 0 early Last crash: LoadProhibited exception on core 1 Registers: PC : 0x40109f9e PS : 0x00060630 A0 : 0x8010a40d A1 : 0x3ffe8440 A2 : 0x3f8517ec A3 : 0x00000001 A4 : 0x00000000 A5 : 0x3ffe84e0 A6 : 0x00000001 A7 : 0x3f84482c A8 : 0x00000001 A9 : 0x3ffe8420 A10 : 0x00000000 A11 : 0x00000004 A12 : 0x3ffe8660 A13 : 0x3ffe8640 A14 : 0x00000008 A15 : 0x00000008 SAR : 0x00000001 EXCCAUSE: 0x0000001c EXCVADDR: 0x00000000 LBEG : 0x4008b112 LEND : 0x4008b11d LCOUNT : 0x00000000 Backtrace: 0x40109f9e 0x4010a40a 0x40108bb1 0x40108dc2 0x4010a893 0x4010ac46 0x40106df6 0x400f28b7 0x400f2965 Event: ticker.1@EventScript 0 secs Version: 3.2.015-212-g86346abe/ota_0/edge (build idf v3.3.4-845-gd59ed8bba Dec 14 2020 20:29:54)

I also had my eventheatassist script running from ticker.1 so not sure if this is related. I also found it crashes more quickly if I check ext12v status with command: 'power ext12v status'. For example one of web plugins runs this command each second as there is no other way to get the ext12v status in GUI.

dexterbg commented 3 years ago

I don't understand your code quote. Do you mean you had myFunction() defined to do OvmsEvents.Raise("ticker.10"), and then subscribed that to the very same event? And then, why would you manually raise a system event at all? And then, if you subscribed to the event you're raising with your handler, you would have created an endless event loop… which would explain your crash, I guess…

I meant you could subscribe to ticker.10 or another suitable system event to regularly check if your custom event loop is still running. You need to introduce a timestamp variable to be able to check when if was run last. If it died, you simple restart it by raising your custom event in that handler.

mjkapkan commented 3 years ago

@dexterbg, Yes, it looks like I did. So let me know if I get this right now: If I create custom event and subscribe to it, I need to run the following once and then in every function call, otherwise it will stop running: OvmsEvents.Raise(myEvent,delayInMilis). But if I subscribe to ticker.10, then my function will be called automatically each time as the ticker event is raised from elsewhere?

dexterbg commented 3 years ago

Yes.

With the exception of not exactly "each" time, as this event will also sometimes fail to get published to javascript handlers due to the same bug. But chances are your handler will get the next ticker event in that case.

mjkapkan commented 3 years ago

Thanks, seems to work now.

dexterbg commented 2 years ago

As suspected previously…

TL;DR: the software PSRAM fix as provided by Espressif in toolkit version 1.22.0-97-gc752ad5-5.2.0 is actually not a complete fix of the PSRAM hardware error on ESP32 revision 1 CPUs.

The hardware bug has been fixed by Espressif in revision 3 of the ESP32 CPU. Revision 3 ESP32 WROVER modules are now being used to build the new OVMS v3.3 modules. Some of the latest v3.2 modules may have a revision 3 ESP32, but that's very unlikely. To check your ESP32 revision, read metric m.hardware:

OVMS# metrics list m.hardware
m.hardware      OVMS WIFI BLE BT cores=2 rev=ESP32/3
                                                   ^ revision 3

The software fix for revision 1 CPUs is enabled for all official OVMS firmware builds and significantly reduces the issue, up to a point we thought it's really solved.

But apparently the issue still shows when both cores have continuous non-uniform load.

It will normally manifest in random memory read corruptions leading to various random strange effects. Duktape pins down the issue by the internal object validation; it stops script execution when it detects a memory corruption, usually emitting a TypeError.

This can be triggered reliably within a short runtime by this simple timer event loop:

// Setup:
script eval 'testcnt=0; PubSub.subscribe("usr.testev", function(ev) { var ms=Number(ev.substr(11))||10; if (++testcnt % (3*1000/ms) == 0) print(ev + ": " + testcnt); OvmsEvents.Raise("usr.testev."+ms, ms); })'

// Start with 10 ms interval:
script eval 'testcnt=0; OvmsEvents.Raise("usr.testev.10")'

The loop outputs checkpoint log entries every 3 seconds and should, in theory, run indefinitely.

On an ESP32 revision 1 with SPIRAM enabled, it will actually abort within a few thousand cycles, terminating with something like…

E (63022) ovms-duktape: [int/PubSub.js:1] TypeError: not writable
|    at [anon] (duk_hobject_props.c:4306) internal
|    at hasKeys (int/PubSub.js:1) strict
|    at messageHasSubscribers (int/PubSub.js:
1) strict
|    at publish (int/PubSub.js:1) strict
|    at [anon] (int/PubSub.js:1) strict preventsyield

On an ESP32 rev 3, the test loop runs indefinitely (as it should) without any issues.

On an ESP32 rev 1 with SPIRAM disabled, the test loop also runs correctly.

To reproduce, check out branch duktape-intram, install support/sdkconfig.duktape-intram. Erase your ovmsmain.js, flash & run the firmware, start the test. Then enable SPIRAM and redo the test.

There currently seems to be no way to get rid of the PSRAM issue -- and the instabilities caused by it -- other than upgrading the hardware to ESP32 revision 3 (OVMS 3.3).

I'm afraid there is nothing we can do about this, so I'm closing this issue.

Regards, Michael