robertknight / rten

ONNX neural network inference engine
100 stars 7 forks source link

Enable re-using pool across graph executions #122

Open robertknight opened 4 months ago

robertknight commented 4 months ago

https://github.com/robertknight/rten/pull/108 added a tensor pool that enables re-use of output buffers for different steps of graph execution. The entire pool is currently freed at the end of the run. For recurrent / autoregressive models where the caller invokes Model::run in a loop, buffer reuse could be further improved by persisting the pool across runs.

Possible APIs:

  1. Add an optional pool parameter to Model::run which allows the user to specify a pool.
  2. Make the pool a field of the Model or Graph. This would require some changes to the pool to enable it to be used from multiple threads concurrently.
robertknight commented 4 months ago

As an extension of this, it would be useful to be able to pass owned tensors as inputs to graph execution, rather than views, so that their buffers can be added to the pool and used to fulfill allocation requests. An example of when this matters are KV-cache outputs that are returned from transformer decoder models. These caches are then fed as inputs into the next graph execution. Currently new KV-cache buffers will get allocated on each run, but it would be more efficient if they could just be recycled.

robertknight commented 4 weeks ago

This was done for sharing between the main graph and subgraphs in https://github.com/robertknight/rten/pull/312. This is simpler because the interpreter loop for a subgraph runs on the same thread as the loop for the parent graphs, so doesn't require making TensorPool usable across threads.