mocleiri commented 2 years ago

I found out that esp32-s3 will have improved hardware support for some dsp functions.

But that espressif also have an esp-dsp module that also provides improved/optimized code for regular esp32 math ops.

Lets investigate on how to use these methods and if its possible for tensorflow lite to use these.

https://docs.espressif.com/projects/esp-dsp/en/latest/esp-dsp-apis.html https://docs.espressif.com/projects/esp-dsp/en/latest/esp-dsp-benchmarks.html

mocleiri commented 2 years ago

I think this would need to be done within tensorflow lite micro to make optimized kernels that are implemented with these methods and then on our end customize micropython to enable this feature within the esp-idf.

mocleiri commented 2 years ago

I'm coordinating with Vikram Dattu at Espressif on how to do this. He plans to implement esp32 s3 optimized kernels within the https://github.com/espressif/tflite-micro-esp-examples repo.

At the moment I plan to add custom kernels upstream in tflite-micro. But I may just make a sub module to his repo if that upstream work gets too complicated.

mocleiri commented 2 years ago

In order to configure this on the tensorflow lite micro side first we need to create a makefile.inc file here: https://github.com/tensorflow/tflite-micro/tree/main/tensorflow/lite/micro/tools/make/ext_libs

and also a esp_download.sh script to go and checkout the esp-dsp repo

MichaelDu9226 commented 2 years ago

Hi, @mocleiri Thanks for your great work. Is there any update on “esp-dsp to accelerate esp32 math ops”? we have tested TF Lite microprython on esp32 S3, the speed is disappointing. The official has launched the official esp-dl library, which is very fast than TF Lite. Can we consider using esp-dl for acceleration?

mocleiri commented 2 years ago

There is gcc bug triggered when building tensorflow/lite/micro/kernels/l2_pool_2d.cc at -O2

I have all the esp32 boards building at -O3 which is probably also slowing things down.

In order to use esp-dsp or esp-dl the tflm ops need to be rewritten to use them.

Do you know which tflm ops your models uses?

You should be able to see if you look at the model in Netron.

I think there are ways within tflm to track which ops are slowest.

My current thoughts are to put the custom kernels upstream in tflm but they could start local as a proof of concept.

MichaelDu9226 commented 2 years ago

Thanks for your reply. I noticed that Vikram Dattu has launched ESP NN, do we have a plan to merge it into tensorflow-micropython-examples? https://github.com/espressif/esp-nn This is the Performance Comparison on tflite-micro-esp-examples readme, which looks very excited.

A quick summary of ESP-NN optimisations, as measured on ESP32-S3:

TFLite Micro Example | without ESP-NN | with ESP-NN -- | -- | -- Person Detection | 2300ms | 54ms

mocleiri commented 2 years ago

@MichaelDu9226 thanks for letting me know about esp-nn appearing. It's used from https://github.com/espressif/tflite-micro-esp-examples

Which also has certain tflm kernels that use the optimized functions.

I was planning on making a place for these kernels in tflm but because he is ahead of me I will pick up these custom kernels from his tflm C++ examples repo directly first.

The optimized tflm kernels are here: https://github.com/espressif/tflite-micro-esp-examples/tree/master/components/tflite-lib/tensorflow/lite/micro/kernels/esp_nn

MichaelDu9226 commented 2 years ago

Thanks for your information. And looking forward to the speed improvement on esp32 S3.

mocleiri commented 2 years ago

@MichaelDu9226 can you test the esp32s3 firmware from this build? https://github.com/mocleiri/tensorflow-micropython-examples/actions/runs/1759424265

I have it building using the esp_nn module and the custom tensorflow kernels. When I run the MICROLITE board it will run hello-world but then gets stuck when running micro-speech. The watch dog timer fires and the board restarts after one or two yes/no matches.

I need to debug with my esp-prog jtag board. But its the PRO cpu timer that is going off which is strange because micropython is supposed to be running only on the APP cpu.

MicroPython e891475-dirty on 2022-01-27; ESP32 module (microlite) with ESP32
Type "help()" for more information.
>>> import run
interpreter_make_new: model size = 18712, tensor area = 8208
Starting
found - no at 0 seconds -> 1

W (171) boot.esp32: PRO CPU has been reset by WDT.
W (172) boot.esp32: WDT reset info: PRO CPU PC=0x40094c83
W (173) boot.esp32: WDT reset info: APP CPU PC=0x402453aa (waiti mode)

There are different optimized functions for esp32s3

mocleiri commented 2 years ago

@TGiles1998 Can you try this branch with some of your esp32s3 boards and report back if the s3 accelerated ops work ok.

There is a bug somewhere for regular esp32 where micro-speech gets into an infinite loop. If it works ok for esp32s3 I could merge it just for those boards.

There is supposed to be a significant improvement in inference performance.

https://github.com/mocleiri/tensorflow-micropython-examples/actions/runs/1759424265

TGiles1998 commented 2 years ago

@mocleiri Hi, Ive just tested the firmware you linked on the ESP32S3 and it has worked. There was no infinite loop and I used a Nanosecond timer time.time_ns() which gave 0.85289 s for model to run:

t = time.time_ns()

print ("time step,y") for c in range(steps): interp.invoke() counter = counter + 1

elapsed = time.time_ns() - t

print(elapsed)

mocleiri commented 2 years ago

@TGiles1998 thanks for taking the time to test this.

On regular ESP32 it also worked fine for me on hello-world.

Are you setup to be able to run it on a larger model like micro-speech (where the esp32 hangs) or perhaps you have another model. It needs to use the Convolution, Fully Connected and Softmax tensorflow ops.

TGiles1998 commented 2 years ago

@mocleiri Very sorry I misread your comment earlier, Ive now applied the micro-speech example to both an ESP32-S3 with SPIRAM and one without. Unfortunately both have produced the infinite errors you mentioned previously.

SPIRAM:

18712 interpreter_make_new: model size = 18712, tensor area = 8208 Starting found - no at 0 seconds -> 1

Guru Meditation Error: Core 1 panic'ed (InstructionFetchError). Exception was unhandled.

Core 1 register dump: PC : 0x3fce9f00 PS : 0x00060430 A0 : 0x8200cbf2 A1 : 0x3fce48a0
A2 : 0x3fceaa64 A3 : 0x3fce3ed4 A4 : 0x00000003 A5 : 0x00000000
A6 : 0x00000000 A7 : 0x3d83f218 A8 : 0x8200c119 A9 : 0x3fce4880
A10 : 0x3fceaa04 A11 : 0x3d83f218 A12 : 0x3fce9f00 A13 : 0x00000000
A14 : 0x3d83e2c0 A15 : 0x3d83cc2c SAR : 0x0000000b EXCCAUSE: 0x00000002
EXCVADDR: 0x3fce9f00 LBEG : 0x42025065 LEND : 0x42025084 LCOUNT : 0x00000000

Backtrace:0x3fce9efd:0x3fce48a0 |<-CORRUPTED

Normal

18712 interpreter_make_new: model size = 18712, tensor area = 8208 Starting found - no at 0 seconds -> 1

Guru Meditation Error: Core 1 panic'ed (InstructionFetchError). Exception was unhandled.

Core 1 register dump: PC : 0x3fce6f98 PS : 0x00060830 A0 : 0x8200cbf2 A1 : 0x3fce4610
A2 : 0x3fce7bac A3 : 0x3fce3c44 A4 : 0x00000003 A5 : 0x00000000
A6 : 0x00000000 A7 : 0x3fcb20b8 A8 : 0x8200c119 A9 : 0x3fce45f0
A10 : 0x3fce7b4c A11 : 0x3fcb20b8 A12 : 0x3fce6f98 A13 : 0x00000000
A14 : 0x3fcb1160 A15 : 0x3fcafacc SAR : 0x0000000b EXCCAUSE: 0x00000002
EXCVADDR: 0x3fce6f98 LBEG : 0x42025065 LEND : 0x42025084 LCOUNT : 0x00000000

Backtrace:0x3fce6f95:0x3fce4610 |<-CORRUPTED

I will have another go at it in case I may have missed something but it does seem to enter the infinite loop.

mocleiri commented 2 years ago

@TGiles1998 thanks for your detailed test results.

I'll have to debug this on the board but at least the errors are this same so a fix for esp32 will likely also fix for the esp32s3.

MichaelDu9226 commented 2 years ago

@mocleiri I custom some changes like oct psram and sdcard support on your code. The person_detection example seems ok. But when I used my custom model trained on edge impulse, the result is not ok. It seems that this bugfix can solve it. But when I merged it into this repo it doesn't work.https://github.com/espressif/tflite-micro-esp-examples/pull/11 As you know, micropython use SPIRAM_USE_MEMMAP, maybe it conflicts with heap_caps_malloc, can you give me some advice? Thanks

MichaelDu9226 commented 2 years ago

This is my custom model custom_model.zip .

MichaelDu9226 commented 2 years ago

Hi @mocleiri I have solved this by referring to your pull https://github.com/micropython/micropython/pull/8219. Thanks.

mocleiri commented 2 years ago

@MichaelDu9226 can you add a comment on https://github.com/mocleiri/tensorflow-micropython-examples/issues/72 for exactly what you did?

I want to improve the documentation.

What I think you did is that your model is too big for regular ram so it lives in PSRAM. Because the ML model is large the scratch pad also needs to be large and is too much to fit into DRAM.

So you fixed it by changing the sdkconfig to use the SPIRAM_USE_MALLOC=y instead of SPIRAM_USE_MMAP strategy.

This splits the SPRAM in half where one part is used by micropython for its heap and the remainder is available to other IDF components to use malloc or in this case the idf capabilities aware malloc.

mocleiri / tensorflow-micropython-examples

See if we can use esp-dsp to accelerate esp32 math ops #9

SPIRAM:

Normal