Open hassanzadeh opened 6 months ago
The main difference is that we are not doing quantization yet, so you need to have enough memory to run the model weights in 16-bit mode. Llama.cpp can run models in 16-bit, but it can also quantize up to 4-bit, which drastically reduces memory needs.
Hey Guys, This is a great library, but I have a question. Is this library is able to use memory as efficiently as the Llama.cpp library? In otherwords, if I'm using a checkpoint that I use with Llama.cpp on a small iOS device, then will the same checkpoint work with swift-transformers (after conversion to CoreML) or there is a possibility that more memory is needed?