Open stl3 opened 1 year ago
hi,
It would be interesting to try running TinyLlama with OnnxStream but the problem would be latency. At the generation of each token, all weights would be read from disk again (1.1GB of data, using 8-bit quantization). This could be prevented by implementing a simple WeightsProvider that caches all the weights in RAM, but then the total memory consumption would be at the same level as other frameworks/libraries, making the use of OnnxStream meaningless. However it could be an interesting experiment :-)
Thanks, Vito
Message ID: @.***>
I was just wondering but would the methods used in OnnxStream further benefit a tiny language model like TinyLlama?. Just wanted to know how far resource usage could be brought down (I know it's not SD model but just wondering if the same could be applied on other types of models). TinyLlama uses about 550MB ram with the 4bit-quantized TinyLlama-1.1B's weight which seems quite enticing for lower end devices.