Open robertknight opened 4 months ago
As an extension of this, it would be useful to be able to pass owned tensors as inputs to graph execution, rather than views, so that their buffers can be added to the pool and used to fulfill allocation requests. An example of when this matters are KV-cache outputs that are returned from transformer decoder models. These caches are then fed as inputs into the next graph execution. Currently new KV-cache buffers will get allocated on each run, but it would be more efficient if they could just be recycled.
This was done for sharing between the main graph and subgraphs in https://github.com/robertknight/rten/pull/312. This is simpler because the interpreter loop for a subgraph runs on the same thread as the loop for the parent graphs, so doesn't require making TensorPool
usable across threads.
https://github.com/robertknight/rten/pull/108 added a tensor pool that enables re-use of output buffers for different steps of graph execution. The entire pool is currently freed at the end of the run. For recurrent / autoregressive models where the caller invokes
Model::run
in a loop, buffer reuse could be further improved by persisting the pool across runs.Possible APIs:
pool
parameter toModel::run
which allows the user to specify a pool.Model
orGraph
. This would require some changes to the pool to enable it to be used from multiple threads concurrently.