vitoplantamura / OnnxStream

Lightweight inference library for ONNX files, written in C++. It can run SDXL on a RPI Zero 2 but also Mistral 7B on desktops and servers.
Other
1.82k stars 79 forks source link

OnnxStream and TinyLlama? #15

Open stl3 opened 1 year ago

stl3 commented 1 year ago

I was just wondering but would the methods used in OnnxStream further benefit a tiny language model like TinyLlama?. Just wanted to know how far resource usage could be brought down (I know it's not SD model but just wondering if the same could be applied on other types of models). TinyLlama uses about 550MB ram with the 4bit-quantized TinyLlama-1.1B's weight which seems quite enticing for lower end devices.

vitoplantamura commented 1 year ago

hi,

It would be interesting to try running TinyLlama with OnnxStream but the problem would be latency. At the generation of each token, all weights would be read from disk again (1.1GB of data, using 8-bit quantization). This could be prevented by implementing a simple WeightsProvider that caches all the weights in RAM, but then the total memory consumption would be at the same level as other frameworks/libraries, making the use of OnnxStream meaningless. However it could be an interesting experiment :-)

Thanks, Vito

Message ID: @.***>