jomjol / AI-on-the-edge-device

Easy to use device for connecting "old" measuring units (water, power, gas, ...) to the digital world
https://jomjol.github.io/AI-on-the-edge-device-docs/
5.92k stars 647 forks source link

Out of memory #2200

Closed wg25 closed 1 year ago

wg25 commented 1 year ago

The Problem

Seit der Version 15.1.0 habe ich nach kurzer Laufzeit folgende Fehlermeldungen im LOG:

[0d09h18m50s] 2023-03-16T17:34:39 [TFLITE] CTfLiteClass::ReadFileToModel: Can't allocate enough memory: 1200904 [0d09h18m50s] 2023-03-16T17:34:39 [CNN] Can't load tflite model /config/dig-cont_0600_s3.tflite -> Exec aborted this round!

Restarts wurden mehrfach durchgeführt. Das Problem ist bereits auf zwei ESP32 aufgetreten.

Bilder werden gemacht, aber die Digitalisierung bricht wegen Speichermangel ab.

Version

Release: v15.1.0 (Commit: 7d33c3e+)

Logfile

[0d00h00m04s] 2023-03-16T07:14:21 <INF> [MAIN] =================================================
[0d00h00m04s] 2023-03-16T07:14:21 <INF> [MAIN] ==================== Start ======================
[0d00h00m04s] 2023-03-16T07:14:21 <INF> [MAIN] =================================================
[0d00h00m04s] 2023-03-16T07:14:21 <INF> [SDCARD] Basic R/W check started...
[0d00h00m04s] 2023-03-16T07:14:21 <INF> [SDCARD] Basic R/W check successful
[0d00h00m04s] 2023-03-16T07:14:22 <INF> [SNTP] TimeServer not defined, using default: pool.ntp.org
[0d00h00m05s] 2023-03-16T07:14:22 <INF> [SNTP] Configuring NTP Client...
[0d00h00m05s] 2023-03-16T08:14:22 <INF> [SNTP] Time zone set to CET-1CEST,M3.5.0,M10.5.0/3
[0d00h00m05s] 2023-03-16T08:14:22 <INF> [SNTP] Time is already set: 2023-03-16 08:14:22
[0d00h00m05s] 2023-03-16T08:14:22 <INF> [MAIN] CPU frequency: 160 MHz
[0d00h00m06s] 2023-03-16T08:14:23 <INF> [SDCARD] Folder/file presence check started...
[0d00h00m06s] 2023-03-16T08:14:23 <INF> [SDCARD] Folder/file presence check successful
[0d00h00m06s] 2023-03-16T08:14:23 <INF> [MAIN] Tag: 'v15.1.0', Release: v15.1.0 (Commit: 7d33c3e+), Date/Time: 2023-03-12 21:54, Web UI: Release: v15.1.0 (Commit: 7d33c3e+)
[0d00h00m06s] 2023-03-16T08:14:23 <INF> [MAIN] Reset reason: Via esp_restart
[0d00h00m06s] 2023-03-16T08:14:23 <INF> [WLANINI] SSID: IoT_2
[0d00h00m06s] 2023-03-16T08:14:23 <INF> [WLANINI] Password: XXXXXXXX
[0d00h00m06s] 2023-03-16T08:14:24 <INF> [WLANINI] Hostname: watermeter
[0d00h00m07s] 2023-03-16T08:14:24 <INF> [MAIN] WLAN config loaded, init WIFI...
[0d00h00m07s] 2023-03-16T08:14:24 <INF> [WIFI] Automatic interface config --> Use DHCP service
[0d00h00m07s] 2023-03-16T08:14:24 <INF> [WIFI] Set hostname to: watermeter
[0d00h00m07s] 2023-03-16T08:14:24 <INF> [WIFI] Init successful
[0d00h00m09s] 2023-03-16T08:14:26 <INF> [MAIN] PSRAM size: 8388608 byte (8MB / 64MBit)
[0d00h00m09s] 2023-03-16T08:14:26 <INF> [MAIN] Total heap: 4165642 byte
[0d00h00m09s] 2023-03-16T08:14:27 <INF> [MAIN] Camera info: PID: 0x26, VER: 0x42, MIDL: 0x7f, MIDH: 0xa2
[0d00h00m10s] 2023-03-16T08:14:27 <INF> [MAIN] Device info: CPU cores: 2, Chip revision: 1
[0d00h00m10s] 2023-03-16T08:14:27 <INF> [MAIN] SD card info: Name: SU16G, Capacity: 15193MB, Free: 15001MB
[0d00h00m10s] 2023-03-16T08:14:27 <INF> [WIFI] Connected to: IoT_2, RSSI: -66
[0d00h00m11s] 2023-03-16T08:14:28 <INF> [WIFI] Assigned IP: 192.168.1.25
[0d00h00m12s] 2023-03-16T08:14:29 <INF> [MAIN] Initialization completed successfully! Starting flow task ...
[0d00h00m17s] 2023-03-16T08:14:34 <INF> [LOGFILE] Set log level to ERROR
[0d00h00m52s] 2023-03-16T08:16:42 <ERR> [TFLITE] CTfLiteClass::ReadFileToModel: Can't allocate enough memory: 1200904
[0d00h00m52s] 2023-03-16T08:16:42 <ERR> [CNN] Can't load tflite model /config/dig-cont_0600_s3.tflite -> Exec aborted this round!
[0d00h03m49s] 2023-03-16T08:19:40 <ERR> [TFLITE] CTfLiteClass::ReadFileToModel: Can't allocate enough memory: 1200904
[0d00h03m50s] 2023-03-16T08:19:40 <ERR> [CNN] Can't load tflite model /config/dig-cont_0600_s3.tflite -> Exec aborted this round!
[0d00h06m50s] 2023-03-16T08:22:40 <ERR> [TFLITE] CTfLiteClass::ReadFileToModel: Can't allocate enough memory: 1200904
[0d00h06m50s] 2023-03-16T08:22:40 <ERR> [CNN] Can't load tflite model /config/dig-cont_0600_s3.tflite -> Exec aborted this round!

Expected Behavior

No response

Screenshots

No response

Additional Context

No response

github-actions[bot] commented 1 year ago

Please provide a logfile! Make sure to first enable the DEBUG level in Settings->Configuration->Debug->Logfile Log Level! Then wait until the issue arises again. When you copy the log into here, please make sure to use Fenced code blocks by wrapping it into separate lines with ```, see https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks#fenced-code-blocks

Slider0007 commented 1 year ago

@wg25: Please provide logfile with activated debug level (one round is enough) and your config file.

@caco3: It seems that we have a framgentation issue of PSRAM...

caco3 commented 1 year ago

hmm, I see. The used is by far the largest model: grafik

Can you try to check if it works ok with another model, eg dig-cont_0611_s3_q.tflite (or what version you have with the _s3_q)?

@Slider0007 I guess we have to add a fallback mode to go back to the internal RAM if PSRAM allocation fails. But this means that on those devices we will earlier get heap problems! :(

i am also wondering why this model is so much larger.

Slider0007 commented 1 year ago

@Slider0007 I guess we have to add a fallback mode to go back to the internal RAM if PSRAM allocation fails. But this means that on those devices we will earlier get heap problems! :(

As long we have the PSRAM in one block the size of the model (even it's that large) is not an issue, but I assume due to allocation of some of the new implemented functions a block in between is allocated with bad timing, so PSRAM largest block avialability is too small now. We should investigate what is allocating the block in between and try to get rid of this block again...

caco3 commented 1 year ago

Yes, I agree We should add a debug log before each allocation, so we can trace it. But this might be a bit tricky as it is not only application code to use the PSRAM...

Also there is the function heap_caps_get_largest_free_block() which we should use before calling heap_caps_malloc(), see https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/mem_alloc.html#_CPPv432heap_caps_get_largest_free_block8uint32_t

Slider0007 commented 1 year ago

I've just had a quick look. With heap_caps_dump(MALLOC_CAP_8BIT | MALLOC_CAP_SPIRAM) the allocation history can be printed to serial log.

It seems that sometimes a large block is in between. But I'm not sure if this is the issue. It's quite large...

Showing data for heap: 0x3f800000
Block 0x3f80086c data, size: 40 bytes, Free: No
Block 0x3f800898 data, size: 20 bytes, Free: No
Block 0x3f8008b0 data, size: 61440 bytes, Free: No
Block 0x3f80f8b4 data, size: 55880 bytes, Free: No
Block 0x3f81d300 data, size: 921600 bytes, Free: No
Block 0x3f8fe304 data, size: 128004 bytes, Free: No
**Block 0x3f91d70c data, size: 2049424 bytes, Free: Yes**
Block 0x3fb11ca0 data, size: 1920 bytes, Free: No
Block 0x3fb12424 data, size: 4860 bytes, Free: No
Block 0x3fb13724 data, size: 1920 bytes, Free: No
Block 0x3fb13ea8 data, size: 4860 bytes, Free: No
Block 0x3fb151a8 data, size: 1920 bytes, Free: No
Block 0x3fb1592c data, size: 4860 bytes, Free: No
Block 0x3fb16c2c data, size: 3072 bytes, Free: No
Block 0x3fb17830 data, size: 25392 bytes, Free: No
Block 0x3fb1db64 data, size: 3072 bytes, Free: No
Block 0x3fb1e768 data, size: 25392 bytes, Free: No
Block 0x3fb24a9c data, size: 3072 bytes, Free: No
Block 0x3fb256a0 data, size: 25392 bytes, Free: No
Block 0x3fb2b9d4 data, size: 3072 bytes, Free: No
Block 0x3fb2c5d8 data, size: 25392 bytes, Free: No
Block 0x3fb3290c data, size: 841452 bytes, Free: Yes

I think we have to wait for the log of @wg25 to see value of largest block after allocation failed (this gets printed with DEBUG LEVEL automatically whenever allocation fails). Im my config I do not get the allocation issue at all even using the same large model file...

BTW: The heap_caps_get_largest_free_block gets printed using the heap handler /heap Heap Total: 2919354 | SPI Free: 2844175 | SPI Larg Block: 2818048 | SPI Min Free: 821907 | Int Free: 75179 | Int Larg Block: 71680 | Int Min Free: 62615

SPI larg block value should always close to SPI free value, otherwise we have heap framgmentation....

caco3 commented 1 year ago

Better use heap_caps_dump_all() ass we not always use MALLOC_CAP_8BIT!

wg25 commented 1 year ago

Hi,

following my config and a logfile with DEBUG but only 2 rounds. If you need more, let me know.

config_and_logfile.zip

DEBUG is still on, as soon as I get the failure again, I will send the logfile again.

Slider0007 commented 1 year ago

@wg25: Thanks. So this means now it's ok again? What was the action before the cycle where all the errors occured. Did you something special? Was it maybe the update process or just a regular reboot?

Yes, please provide log whenever it occurs again.

caco3 commented 1 year ago

I did a quick test as of how much memory I can allocate on PSRAM:

        uint32_t kbytes = 1000;
        while(1) {
            void *x = heap_caps_malloc(kbytes * 1024, MALLOC_CAP_8BIT | MALLOC_CAP_SPIRAM);
            if (x != NULL) {
                LogFile.WriteToFile(ESP_LOG_INFO, TAG, "Allocated " + to_string(kbytes) + " kbytes");
                heap_caps_free(x);
            }
            else {
                LogFile.WriteToFile(ESP_LOG_ERROR, TAG, "Failed to allocate " + to_string(kbytes) + " kbytes");
                break;
            }
            kbytes += 200;
        }

Result:

I (9746) MAIN: PSRAM size: 8388608 byte (8MB / 64MBit)
I (9776) MAIN: Allocated 1000 kbytes
I (9806) MAIN: Allocated 1200 kbytes
I (9826) MAIN: Allocated 1400 kbytes
I (9856) MAIN: Allocated 1600 kbytes
I (9876) MAIN: Allocated 1800 kbytes
I (9906) MAIN: Allocated 2000 kbytes
I (9936) MAIN: Allocated 2200 kbytes
I (9956) MAIN: Allocated 2400 kbytes
I (9986) MAIN: Allocated 2600 kbytes
I (10006) MAIN: Allocated 2800 kbytes
I (10036) MAIN: Allocated 3000 kbytes
I (10066) MAIN: Allocated 3200 kbytes
I (10096) MAIN: Allocated 3400 kbytes
I (10126) MAIN: Allocated 3600 kbytes
I (10226) MAIN: Allocated 3800 kbytes
E (10256) MAIN: Failed to allocate 4000 kbytes

I did it directly after startup, so no other PSRAM allocations in use yet.

It worked as expected up to 4 MBytes (the max of addressable PSRAM).

@Slider0007 One solution would be that we allocate a large enough buffer for the model at startup and do not free it again so it can be reused. Fragmentation is something you only can work around static allocation or using a re-allocation mechanism.

Slider0007 commented 1 year ago

@Slider0007 One solution would be that we allocate a large enough buffer for the model at startup and do not free it again so it can be reused. Fragmentation is something you only can work around static allocation or using a re-allocation mechanism.

@caco3: This could be an option and most likely the easiest way. I assume this is a timing issue in terms of allocations times. I had same topic when I implemented JPG to RAM functionality. There I got it right by ordered the allocation in right way. Now we have some system libs which allocates in PSRAM. This means it's more difficult to reorder or have it under full control.

We would have to define max model filesize and need an error message whenever model is larger than allocated memory to avoid buffer overrun. Right now, we have to keep memory free for largest model anyway...

wg25 commented 1 year ago

@wg25: Thanks. So this means now it's ok again? What was the action before the cycle where all the errors occured. Did you something special? Was it maybe the update process or just a regular reboot?

@Slider0007: I made the update from 14.0.3 to 15.1.0. Worked for at least 36hrs. I changed nothing in my normal config. The only thing I tried was the 240MHz option. Then it occured the first time. I switched back to 160MHz and failure occured again. I always made only soft reboots, no power up cycles. Up to now it seems to be ok, but I'll keep you posted.

caco3 commented 1 year ago

The CPU frequency should not have an impact on the PSRAM usage. And keep in mind that the higher CPU frequency might lead to general instabilities!

I will try to modify the model memory allocation to be static to work around the issue you found.

caco3 commented 1 year ago

@Slider0007

Based on my investigations in

I think we should use a statically allocated memory for following variables:

image

Slider0007 commented 1 year ago

First of all, great evaluation and visualization of the topic! 👍

I think we should use a statically allocated memory for following variables:

Thinking again and recap why I had the same topic during JPG implementation, unfortunately it seems not possible to just statically allocate all large files (modelfile, tensor momory, ImageBasis, ImageTMP) in early stage due to shortage of memory.

In aligning state CImageTMP is needed to do alignment stuff and generating alg_roi image (this temporarily consumes 921kB). After that this memory has to be mandatorily freed to ensure enough memory for the following state "ROI evaluation". In this step model and tensor memory needs to be allocated (ca. 1200kb + 820kB --> ca. 2000kB). Some of the freed memory of ImageTMP needs to be reused here to prevent running out of PSRAM. This is the reason why CImageTMP needs to be deleted at the end of alignment step.

Biggest consumers of PSRAM: CImageBasis: 920kB --> needed all the time CImageTMP: 920KB --> only needed in alignment Tensor memory ca. 820kB --> only needed in ROI processing Model up to 1200kb --> only needed in ROI processing ALG_ROI JPG 128KB --> needed all the time, mandatory to show image ROI images (size depending on number of ROIs) --> needed all the time, mandatory to show ROI info CAM buffer ca. 65kB --> needed all the time, mandatory further stuff we moved to PSRAM

If you sum up you see that's not sufficient anymore.Therefore some memory needs to be shared (ImageTMP and model/tensor in actual configuration. This you can see in your screenshot very well).

I assume we unfortunately have to go deeper in actual allocation history. If I need to guess I'd say that's it is maybe related to MQTT outbox allocation, because this part of allocation is done quite late in comparison other relocated services like WIFI stacks, etc...

Maybe for a test, we can give up the outbox memory and switch that part back to internal and see what's the reaction of the users. I was able to reproduce the issue only one time, so it' seems not easily traceable, maybe related to the variable time when wifi is connecting and derived from this the time where MQTT connection get established. We only have to ensure that the "breathing part" of memory is at the end in one block. But as I said, this is just my feeling without have tjhis proved...

caco3 commented 1 year ago

unfortunately it seems not possible to just statically allocate all large files (modelfile, tensor momory, ImageBasis, ImageTMP) in early stage due to shortage of memory.

model and tensor_arena and the initial rgb_image are always allocated during the same time. they are the larges blocks by far. Are there other blocks of PSRAM which are used outside of those allocations? if not I think we would be on the save side to not free thos elarge blocks. -> needs testing

caco3 commented 1 year ago

Biggest consumers of PSRAM: CImageBasis: 920kB --> needed all the time CImageTMP: 920KB --> only needed in alignment Tensor memory ca. 820kB --> only needed in ROI processing Model up to 1200kb --> only needed in ROI processing ALG_ROI JPG 128KB --> needed all the time, mandatory to show image ROI images (size depending on number of ROIs) --> needed all the time, mandatory to show ROI info CAM buffer ca. 65kB --> needed all the time, mandatory further stuff we moved to PSRAM

Thanks for that overview!

CImageTMP: 920KB --> only needed in alignment

But it seems that also never gets freed again!

Slider0007 commented 1 year ago

CImageTMP: 920KB --> only needed in alignment

But it seems that also never gets freed again!

It gets freed. You have it printed in your overview (correction: my first labeled line is wrong labeled, but I'm too lazy to replace the screenshot ;-)). From timing exactly before ROI processing begins and model + tensor gets loaded. image

Hmmm, what is the difference between the screenshot in this post and the screenshot in PR post. Because in this post's screenshot no free lines are visualized anymore. Did you make further changes in comparison to the state of PR. I assume it's just missing in your screenshot...

MrReSc commented 1 year ago

I have the same problem with 0611 and 0610. Model 0600 works.

[0d00h00m53s] 2023-03-19T06:17:46 <ERR> [TFLITE] CTfLiteClass::ReadFileToModel: Can't allocate enough memory: 1200956
[0d00h00m53s] 2023-03-19T06:17:46 <ERR> [CNN] Can't load tflite model /config/dig-cont_0611_s3.tflite -> Exec aborted this round!
[0d00h00m53s] 2023-03-19T06:24:22 <ERR> [TFLITE] CTfLiteClass::ReadFileToModel: Can't allocate enough memory: 1200904
[0d00h00m53s] 2023-03-19T06:24:22 <ERR> [CNN] Can't load tflite model /config/dig-cont_0610_s3.tflite -> Exec aborted this round!
caco3 commented 1 year ago

We did some improvements, can you test https://github.com/jomjol/AI-on-the-edge-device/actions?query=branch%3Adisable-custom-MQTT-Outbox-and-move-it-back-to-internal-RAM and let us know if when the issue occurred again?

caco3 commented 1 year ago

we will also shortly add this to a new release

Slider0007 commented 1 year ago

@caco3, @jomjol: Some more research shows that it's most likely NOT the MQTT outbox, but the WIFI / network stack part in combination of loading and freeing model /tensor in init phase which fragments to PSRAM (same issue mentioned here without any solution: https://github.com/espressif/esp-idf/issues/9088)

Enabled option: WIFI to PSRAM

Showing data for heap: 0x3f800000 Block 0x3f80086c data, size: 40 bytes, Free: No Block 0x3f800898 data, size: 20 bytes, Free: No Block 0x3f8008b0 data, size: 61440 bytes, Free: No Block 0x3f80f8b4 data, size: 55880 bytes, Free: No Block 0x3f81d300 data, size: 80 bytes, Free: No Block 0x3f81d354 data, size: 64 bytes, Free: No Block 0x3f81d398 data, size: 16 bytes, Free: No Block 0x3f81d3ac data, size: 48 bytes, Free: Yes Block 0x3f81d3e0 data, size: 16 bytes, Free: No Block 0x3f81d3f4 data, size: 12 bytes, Free: No Block 0x3f81d404 data, size: 24 bytes, Free: No Block 0x3f81d420 data, size: 2080 bytes, Free: No Block 0x3f81dc44 data, size: 136 bytes, Free: No Block 0x3f81dcd0 data, size: 136 bytes, Free: No Block 0x3f81dd5c data, size: 1696 bytes, Free: No Block 0x3f81e400 data, size: 1696 bytes, Free: No Block 0x3f81eaa4 data, size: 1696 bytes, Free: No Block 0x3f81f148 data, size: 1696 bytes, Free: No Block 0x3f81f7ec data, size: 1696 bytes, Free: No Block 0x3f81fe90 data, size: 1696 bytes, Free: No Block 0x3f820534 data, size: 1696 bytes, Free: No Block 0x3f820bd8 data, size: 1696 bytes, Free: No Block 0x3f82127c data, size: 1696 bytes, Free: No Block 0x3f821920 data, size: 1696 bytes, Free: No Block 0x3f821fc4 data, size: 1696 bytes, Free: No Block 0x3f822668 data, size: 1696 bytes, Free: No Block 0x3f822d0c data, size: 1696 bytes, Free: No Block 0x3f8233b0 data, size: 1696 bytes, Free: No Block 0x3f823a54 data, size: 1696 bytes, Free: No Block 0x3f8240f8 data, size: 1696 bytes, Free: No Block 0x3f82479c data, size: 552 bytes, Free: No Block 0x3f8249c8 data, size: 768 bytes, Free: No Block 0x3f824ccc data, size: 184 bytes, Free: No Block 0x3f824d88 data, size: 184 bytes, Free: No Block 0x3f824e44 data, size: 80 bytes, Free: No Block 0x3f824e98 data, size: 56 bytes, Free: No Block 0x3f824ed4 data, size: 12 bytes, Free: No Block 0x3f824ee4 data, size: 12 bytes, Free: No Block 0x3f824ef4 data, size: 56 bytes, Free: No Block 0x3f824f30 data, size: 12 bytes, Free: No Block 0x3f824f40 data, size: 80 bytes, Free: No Block 0x3f824f94 data, size: 12 bytes, Free: No Block 0x3f824fa4 data, size: 12 bytes, Free: No Block 0x3f824fb4 data, size: 16 bytes, Free: No Block 0x3f824fc8 data, size: 76 bytes, Free: No Block 0x3f825018 data, size: 56 bytes, Free: No Block 0x3f825054 data, size: 80 bytes, Free: No Block 0x3f8250a8 data, size: 320 bytes, Free: Yes Block 0x3f8251ec data, size: 60 bytes, Free: No Block 0x3f82522c data, size: 120 bytes, Free: No Block 0x3f8252a8 data, size: 16 bytes, Free: No Block 0x3f8252bc data, size: 16 bytes, Free: No Block 0x3f8252d0 data, size: 16 bytes, Free: No Block 0x3f8252e4 data, size: 16 bytes, Free: No Block 0x3f8252f8 data, size: 16 bytes, Free: No Block 0x3f82530c data, size: 24 bytes, Free: No Block 0x3f825328 data, size: 16 bytes, Free: No Block 0x3f82533c data, size: 12 bytes, Free: No Block 0x3f82534c data, size: 16 bytes, Free: No Block 0x3f825360 data, size: 12 bytes, Free: No Block 0x3f825370 data, size: 136 bytes, Free: No Block 0x3f8253fc data, size: 921600 bytes, Free: No Block 0x3f906400 data, size: 128004 bytes, Free: No Block 0x3f925808 data, size: 1920 bytes, Free: No Block 0x3f925f8c data, size: 4860 bytes, Free: No Block 0x3f92728c data, size: 1920 bytes, Free: No Block 0x3f927a10 data, size: 4860 bytes, Free: No Block 0x3f928d10 data, size: 1920 bytes, Free: No Block 0x3f929494 data, size: 4860 bytes, Free: No Block 0x3f92a794 data, size: 3072 bytes, Free: No Block 0x3f92b398 data, size: 17788 bytes, Free: No Block 0x3f92f918 data, size: 3072 bytes, Free: No Block 0x3f93051c data, size: 17788 bytes, Free: No Block 0x3f934a9c data, size: 3072 bytes, Free: No Block 0x3f9356a0 data, size: 17788 bytes, Free: No Block 0x3f939c20 data, size: 3072 bytes, Free: No Block 0x3f93a824 data, size: 17788 bytes, Free: No Block 0x3f93eda4 data, size: 3072 bytes, Free: No Block 0x3f93f9a8 data, size: 17788 bytes, Free: No Block 0x3f943f28 data, size: 208 bytes, Free: No Block 0x3f943ffc data, size: 2035932 bytes, Free: Yes --> This could be the part where model + tensor gets allocated and free again (ca. 1200kb + 820kb --> ca. 2020kb). Before they got freed again further ?WIFI stack? allocations take place and split the free space into two blocks. --> That's the not so nice part, because it splits the large free RAM blocks Block 0x3fb350dc data, size: 56 bytes, Free: No Block 0x3fb35118 data, size: 16 bytes, Free: No Block 0x3fb3512c data, size: 208 bytes, Free: No Block 0x3fb35200 data, size: 164 bytes, Free: Yes Block 0x3fb352a8 data, size: 208 bytes, Free: No Block 0x3fb3537c data, size: 208 bytes, Free: No Block 0x3fb35450 data, size: 208 bytes, Free: No Block 0x3fb35524 data, size: 64 bytes, Free: No Block 0x3fb35568 data, size: 208 bytes, Free: Yes Block 0x3fb3563c data, size: 56 bytes, Free: No Block 0x3fb35678 data, size: 72 bytes, Free: Yes Block 0x3fb356c4 data, size: 16 bytes, Free: No Block 0x3fb356d8 data, size: 4628 bytes, Free: Yes Block 0x3fb368f0 data, size: 208 bytes, Free: No Block 0x3fb369c4 data, size: 12 bytes, Free: No Block 0x3fb369d4 data, size: 1540 bytes, Free: Yes Block 0x3fb36fdc data, size: 56 bytes, Free: No <-- Block 0x3fb37018 data, size: 823264 bytes, Free: Yes

In general, it's not really nice to see that this functionality allocates every small variable in that way...

Disabled option WIFI to PSRAM (activated MQTT outbox + bss segments):

Showing data for heap: 0x3f802c70 Block 0x3f8034dc data, size: 40 bytes, Free: No Block 0x3f803508 data, size: 20 bytes, Free: No Block 0x3f803520 data, size: 61440 bytes, Free: No Block 0x3f812524 data, size: 55880 bytes, Free: No Block 0x3f81ff70 data, size: 921600 bytes, Free: No Block 0x3f900f74 data, size: 128004 bytes, Free: No Block 0x3f92037c data, size: 1920 bytes, Free: No Block 0x3f920b00 data, size: 4860 bytes, Free: No Block 0x3f921e00 data, size: 1920 bytes, Free: No Block 0x3f922584 data, size: 4860 bytes, Free: No Block 0x3f923884 data, size: 1920 bytes, Free: No Block 0x3f924008 data, size: 4860 bytes, Free: No Block 0x3f925308 data, size: 3072 bytes, Free: No Block 0x3f925f0c data, size: 17788 bytes, Free: No Block 0x3f92a48c data, size: 3072 bytes, Free: No Block 0x3f92b090 data, size: 17788 bytes, Free: No Block 0x3f92f610 data, size: 3072 bytes, Free: No Block 0x3f930214 data, size: 17788 bytes, Free: No Block 0x3f934794 data, size: 3072 bytes, Free: No Block 0x3f935398 data, size: 17788 bytes, Free: No Block 0x3f939918 data, size: 3072 bytes, Free: No Block 0x3f93a51c data, size: 17788 bytes, Free: No Block 0x3f93ea9c data, size: 12 bytes, Free: No Block 0x3f93eaac data, size: 2889036 bytes, Free: Yes

Heap Total: 2941862 | SPI Free: 2889047 | SPI Large Block: 2883584 | SPI Min Free: 868367 | Int Free: 52815 | Int Large Block: 34816 | Int Min Free: 31347

Disabled option WIFI to PSRAM, no custom outbox, no bss segments (situation before PSRAM mods):

Showing data for heap: 0x3f802c70 Block 0x3f8034dc data, size: 40 bytes, Free: No Block 0x3f803508 data, size: 20 bytes, Free: No Block 0x3f803520 data, size: 61440 bytes, Free: No Block 0x3f812524 data, size: 55880 bytes, Free: No Block 0x3f81ff70 data, size: 921600 bytes, Free: No Block 0x3f900f74 data, size: 128004 bytes, Free: No Block 0x3f92037c data, size: 1920 bytes, Free: No Block 0x3f920b00 data, size: 4860 bytes, Free: No Block 0x3f921e00 data, size: 1920 bytes, Free: No Block 0x3f922584 data, size: 4860 bytes, Free: No Block 0x3f923884 data, size: 1920 bytes, Free: No Block 0x3f924008 data, size: 4860 bytes, Free: No Block 0x3f925308 data, size: 3072 bytes, Free: No Block 0x3f925f0c data, size: 17788 bytes, Free: No Block 0x3f92a48c data, size: 3072 bytes, Free: No Block 0x3f92b090 data, size: 17788 bytes, Free: No Block 0x3f92f610 data, size: 3072 bytes, Free: No Block 0x3f930214 data, size: 17788 bytes, Free: No Block 0x3f934794 data, size: 3072 bytes, Free: No Block 0x3f935398 data, size: 17788 bytes, Free: No Block 0x3f939918 data, size: 3072 bytes, Free: No Block 0x3f93a51c data, size: 17788 bytes, Free: No Block 0x3f93ea9c data, size: 2889052 bytes, Free: Yes

Short-term: Only deactivating WIFI stack allocation to PSRAM can improve the situation. Long-term: Avoid unloading of models after init + preallocating and sharing memory for tensor / ImageTMP could be a solution.

caco3 commented 1 year ago

sharing memory for tensor / ImageTMP could be a solution.

I already started some work for this, but lost it somehow, so I would have to do it again. For now I just would reserve shared memory for this and keep the possible rework to keep all models in RAM for later.

wg25 commented 1 year ago

We did some improvements, can you test https://github.com/jomjol/AI-on-the-edge-device/actions?query=branch%3Adisable-custom-MQTT-Outbox-and-move-it-back-to-internal-RAM and let us know if when the issue occurred again?

I uploaded this version now to one of my cams. Debug is on. Waiting... But one behaviour is new (at least for me). The model dig-cont_0611_s3_q is on the ESP but it's not selectable. I removed it and uploaded it again, nothing. Reboot, nothing. Also a bug?

caco3 commented 1 year ago

Based on the analysis in https://github.com/jomjol/AI-on-the-edge-device/pull/2215 we are thinking of revert some PSRAM optimizations and go back to the state in 15.0.1, see https://github.com/jomjol/AI-on-the-edge-device/pull/2224

caco3 commented 1 year ago

There is a new release 15.1.1 which addresses this issue.

srLinux commented 1 year ago

idk if it is important, i fixed the issue after create new filesystem on the sdcard and copy the data again. but it seems like to be a reproducible problem. i will provide my logfile if it happens again to me. Happens on 15.1.1 too. @caco3 it make sense to get another board with 4M PSRAM?

srLinux commented 1 year ago

[0d00h00m11s] 2023-03-26T16:23:14 [PSRAM] Failed to allocate 0 bytes in PSRAM for 'TFLITE->modelfile'! [0d00h00m11s] 2023-03-26T16:23:14 [TFLITE] CTfLiteClass::ReadFileToModel: Can't allocate enough memory: 0 [0d00h00m11s] 2023-03-26T16:23:14 [CNN] Can't load tflite model -> Init aborted!

after rebooting the device, camera can't be initialized.

[0d00h00m07s] 2023-03-26T16:26:04 [MAIN] PSRAM size: 8388608 byte (8MB / 64MBit) [0d00h00m07s] 2023-03-26T16:26:04 [MAIN] Total heap: 4273266 byte [0d00h00m07s] 2023-03-26T16:26:04 [MAIN] Camera init failed (0x105), retrying...

i noticed it just happens if i use the reboot option from the webpanel. Now i need to copy the data back to sdcard and start configuration over again.

caco3 commented 1 year ago

@caco3 it make sense to get another board with 4M PSRAM?

4 MB or 8 MB are ok (the firmware only is able to use 4 MB).

after rebooting the device, camera can't be initialized.

Does that still happen with 15.1.1?

i noticed it just happens if i use the reboot option from the webpanel. Now i need to copy the data back to sdcard and start configuration over again.

Hmm, that is odd, I can reproduce this! Note that the camera initialization and memory allocation issues are independent of the SD-Card.

Please provide full debug log!

srLinux commented 1 year ago

next time it will happens i will provide the full log, sorry for just the snippets. Yip this happens in the 15.1.1 I am with you, that is not SD-Card related. But the wiping did the trick, and i have no idea why. I am still investigating! I wondered about the log entry is: Can't allocate enough memory: 0 why 0? maybe one of the tflites are corrupted? I did a backup from the card, and i will diff it next time. So stay tuned @caco3 and thank you for your participation :-)

caco3 commented 1 year ago

maybe one of the tflites are corrupted? I did a backup from the card, and i will diff it next time. So stay tuned @caco3 and thank you for your participation :-)

Yes, that might be a possible issue.

Please create a new bug report since it is not the same issue as this issue is about.