tensorflow / tflite-micro

Infrastructure to enable deployment of ML models to low-power resource-constrained embedded targets (including microcontrollers and digital signal processors).
Apache License 2.0
1.81k stars 793 forks source link

Feature request: Add support for multiple tensor arenas #2627

Closed andresovela closed 3 weeks ago

andresovela commented 1 month ago

I'd like to propose a feature to allow the interpreter to use more than one tensor arena.

The use case is as follows: the platform running the interpreter has more than one memory region. Some regions are faster to access than others, but the faster regions are smaller in size than their slower counterparts. Imagine a microcontroller with 128 kB of internal SRAM but it has access to 32 MB of external RAM on a separate IC.

If you have a model that requires a tensor arena of 140 kB, it unfortunately can't use the internal SRAM because the entire arena would not fit, so the arena would need to be placed in external RAM.

It would be good if the interpreter could use both the internal SRAM and external RAM to get better inference performance. Maybe we could even make it so that we can specify which ops use which tensor arena, to allow the most often used ops to use the fast access arena.

rascani commented 1 month ago

Thank you for filing this request! This has come up in the past, and I do think it would be a great idea. It's a pretty complex feature to implement though, so I don't foresee it being tackled by the TFLM team any time soon. Choosing an optimal memory placement strategy with multiple memory segments at runtime probably requires a lot of additional context that the runtime just doesn't currently have.

I would also like to point out that it is currently possible to use two tensor arenas, but how they are used is fixed. Specifically, you can set up a persistent arena and a non-persistent arena. The non-persistent arena will be used for scratch buffers and activation tensors. The persistent arena would be used for storing any objects that need to survive beyond a single inference call, such as node data structures, resource variables, framework bookkeeping, etc. I find this most useful for memory savings in multi-model scenarios, where you can have overlapping temporary tensor arenas so long as inference is not performed concurrently.

You can configure the Interpreter this way by constructing a MicroAllocator using this create method: https://github.com/tensorflow/tflite-micro/blob/ff5c090ecac692689506e5598b0f3b8391ef6be0/tensorflow/lite/micro/micro_allocator.h#L149

andresovela commented 1 month ago

Thank you for the tip! I'll look into it and evaluate how useful is this for my application.

In the case where this persistent/non-persistent split is not enough to take advantage of the available SRAM, I'd probably be willing to work on this provided there's some mentoring available, since I'm new to this codebase. Would that be of interest?

github-actions[bot] commented 3 weeks ago

"This issue is being marked as stale due to inactivity. Remove label or comment to prevent closure in 5 days."

github-actions[bot] commented 3 weeks ago

"This issue is being closed because it has been marked as stale for 5 days with no further activity."

nicklasb commented 1 week ago

@rascani Hi, either I have misunderstood something, or it is not possible to instantiate MicroAllocator as it is internal? Do you know of any example?

Either way, this functionality is very relevant for me, the ESP32 S3 I use infers prohibitivey slow, to a large part because everything ends up in PSRAM, leaving the much faster SRAM unused. I would think this would affect almost everyone trying any real deep learning on the ESP32, and I suspect the P4 will make it more common.

nicklasb commented 1 week ago

@andresovela , @rascani So for the record here I have tried instantiating MicroAllocator with non-persistant and persistant allocation, problem is that either my SRAM has to be 5 MB (it isn't, i have ~250K available) or it ends up not being used. So I am trying to create a custom allocator/planner for this purpose instead. All ideas are welcome!

nicklasb commented 1 day ago

@andresovela , @rascani Basically, my attempts did not make much progress because of how TF lite micro always, ends up with a huge reallocallocation which I am not able to effect. I have som proposals in #2675. As the ESP32-P4 is coming now (literaly on route to here), a host of opportunities for inferring real deep learning models are just on the verge of being viable.