Closed SinanAkkoyun closed 1 year ago
I haven't looked at the architecture yet, but it probably isn't hard. If you want a fast model, there's also TinyLlama to consider. They've only released the first snapshot, but it is producing text. I've gotten it to 635 tokens/second on V2 so far, but given how tiny it is it can probably be tuned to run faster still. (I.e. the implementation really isn't optimized for anything smaller than 3B.)
Thank you so much for the fast response, 635 t/s sounds really amazing :D I am glad that you got V2 working!
Can't wait for it to release... ♥.♥
It's beautiful :o THANK YOU SO MUCH FOR ALL THE WORK!!!
You also were able to fit the 70B model on a single GPU and implemented your own quantization method with mixed quantization and now offer such awesome speeds?! Just mind-blowingly crazy...
Love it!
Hey! I wanted to ask, how difficult would it be to add phi-1.5 support? I would be super interested in running it, the small size would yield even faster generation speeds than the 3B openllama model :)
I am just curious if that would be of much work, I totally get if this is of no priority and exllama 2 is Thank you!