OnnxStream and TinyLlama?

vitoplantamura / OnnxStream

Lightweight inference library for ONNX files, written in C++. It can run Stable Diffusion XL 1.0 on a RPI Zero 2 (or in 298MB of RAM) but also Mistral 7B on desktops and servers. ARM, x86, WASM, RISC-V supported. Accelerated by XNNPACK.

Other

1.86k stars 84 forks source link

hi,

It would be interesting to try running TinyLlama with OnnxStream but the problem would be latency. At the generation of each token, all weights would be read from disk again (1.1GB of data, using 8-bit quantization). This could be prevented by implementing a simple WeightsProvider that caches all the weights in RAM, but then the total memory consumption would be at the same level as other frameworks/libraries, making the use of OnnxStream meaningless. However it could be an interesting experiment :-)

Thanks, Vito

Message ID: @.***>

vitoplantamura / OnnxStream

OnnxStream and TinyLlama? #15