turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

phi-1.5 support? #284

Closed SinanAkkoyun closed 1 year ago

SinanAkkoyun commented 1 year ago

Hey! I wanted to ask, how difficult would it be to add phi-1.5 support? I would be super interested in running it, the small size would yield even faster generation speeds than the 3B openllama model :)

I am just curious if that would be of much work, I totally get if this is of no priority and exllama 2 is Thank you!

turboderp commented 1 year ago

I haven't looked at the architecture yet, but it probably isn't hard. If you want a fast model, there's also TinyLlama to consider. They've only released the first snapshot, but it is producing text. I've gotten it to 635 tokens/second on V2 so far, but given how tiny it is it can probably be tuned to run faster still. (I.e. the implementation really isn't optimized for anything smaller than 3B.)

SinanAkkoyun commented 1 year ago

Thank you so much for the fast response, 635 t/s sounds really amazing :D I am glad that you got V2 working!

SinanAkkoyun commented 1 year ago

Can't wait for it to release... ♥.♥

turboderp commented 1 year ago

https://github.com/turboderp/exllamav2

SinanAkkoyun commented 1 year ago

It's beautiful :o THANK YOU SO MUCH FOR ALL THE WORK!!!

You also were able to fit the 70B model on a single GPU and implemented your own quantization method with mixed quantization and now offer such awesome speeds?! Just mind-blowingly crazy...

Love it!