Open miguel-arrf opened 11 months ago
Hello @miguel-arrf!
Thanks a lot for the detailed report, much appreciated 🙌 I agree that Phi-2 is a very exciting model to try! There are additional quantization techniques that we could apply, but I'd suggest we debug float16
first. Let me try to retrace your steps and I'll get back to you soon :)
Regarding speed in float32
, it could be for a variety of reasons: perhaps some layers are being scheduled to run on CPU, perhaps the model is using too much memory and your system swaps. I'll take a look too. In addition to that, there are some performance optimization techniques for LLMs (kv caching, in particular) that we are currently working on, and that should help a lot. I'll keep you posted about that as well.
Finally, if you used the latest version of exporters
, I believe that the tokenizer should have been picked up automatically by swift-transformers
/ swift-chat
. I'll check that out too.
Hi
I wanted to know if these swift transformers for phi 2 is available in hugging face
Hi @omkar806: not yet, but soon. We found some problems during conversion of the model. As @miguel-arrf described, float16
inference does not work after conversion, we probably need to keep some layers in float32
. I didn't have time to debug in depth, but want to do it soon. We'll post here when it's done.
hi @pcuenca , may I ask that is it finished now?
Working on it this week
Hello @pcuenca, any updates on this?
HI @pcuenca Anything on this yet ?
Hi!
I'm converting the Microsoft's Phi-2 model to use with
swift-transformers
.The conversion process is actually very seamless:
Note that by default the
export
function is usingfloat32
.Then, I'm using the swift-chat repo to run the model. I'm using the Llama-2 tokenizer. It works perfectly well out of the box. There was only one missing token, the 'space' (' '), but apart from that it works.
The issue is that it is super, super slow (I have a MacBook Pro with 16gb RAM and M1) and it's using close to 11GB of memory. Although the inference is slow, the output makes sense.
Given that it is so slow, I converted the model using
float16
:The model is now 5GB, but the inference is giving me gibberish (the output was, before, something that made sense, now it's just a bunch of exclamation marks). I downloaded the model (the 5GB one) into my iPhone 14 Pro and after a few seconds, while it is loading, the app just closes itself.
float32
)?quantize="float16"
basically instantaneous, but outputting gibberish?Thank you so much for the help!