huggingface / exporters

Export Hugging Face models to Core ML and TensorFlow Lite
Apache License 2.0
622 stars 46 forks source link

Export Phi-2 #67

Open miguel-arrf opened 11 months ago

miguel-arrf commented 11 months ago

Hi!

I'm converting the Microsoft's Phi-2 model to use with swift-transformers.

The conversion process is actually very seamless:

from transformers import AutoTokenizer, AutoModelForCausalLM
from exporters.coreml import CoreMLConfig
from exporters.coreml import export

model = "microsoft/phi-2"

# Load tokenizer and PyTorch weights form the Hub
tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
pt_model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torchscript=True)

class Phi2CoreMLConfig(CoreMLConfig):
    modality = "text"

coreml_config = Phi2CoreMLConfig(pt_model.config, task="text-generation")
mlmodel = export(tokenizer, pt_model, coreml_config)
mlmodel.save("Phi2.mlpackage")

Note that by default the export function is using float32.

Then, I'm using the swift-chat repo to run the model. I'm using the Llama-2 tokenizer. It works perfectly well out of the box. There was only one missing token, the 'space' (' '), but apart from that it works.

The issue is that it is super, super slow (I have a MacBook Pro with 16gb RAM and M1) and it's using close to 11GB of memory. Although the inference is slow, the output makes sense.

Given that it is so slow, I converted the model using float16:

mlmodel = export(tokenizer, pt_model, coreml_config, quantize="float16")

The model is now 5GB, but the inference is giving me gibberish (the output was, before, something that made sense, now it's just a bunch of exclamation marks). I downloaded the model (the 5GB one) into my iPhone 14 Pro and after a few seconds, while it is loading, the app just closes itself.

  1. How can I further decrease the model size? Can we quantize the model even more using CoreML?
  2. Why is the inference speed so slow (with the default float32)?
  3. Why is the model with quantize="float16" basically instantaneous, but outputting gibberish?

Thank you so much for the help!

pcuenca commented 11 months ago

Hello @miguel-arrf!

Thanks a lot for the detailed report, much appreciated 🙌 I agree that Phi-2 is a very exciting model to try! There are additional quantization techniques that we could apply, but I'd suggest we debug float16 first. Let me try to retrace your steps and I'll get back to you soon :)

Regarding speed in float32, it could be for a variety of reasons: perhaps some layers are being scheduled to run on CPU, perhaps the model is using too much memory and your system swaps. I'll take a look too. In addition to that, there are some performance optimization techniques for LLMs (kv caching, in particular) that we are currently working on, and that should help a lot. I'll keep you posted about that as well.

Finally, if you used the latest version of exporters, I believe that the tokenizer should have been picked up automatically by swift-transformers / swift-chat. I'll check that out too.

omkar806 commented 9 months ago

Hi

omkar806 commented 9 months ago

I wanted to know if these swift transformers for phi 2 is available in hugging face

pcuenca commented 9 months ago

Hi @omkar806: not yet, but soon. We found some problems during conversion of the model. As @miguel-arrf described, float16 inference does not work after conversion, we probably need to keep some layers in float32. I didn't have time to debug in depth, but want to do it soon. We'll post here when it's done.

baozzz1 commented 7 months ago

hi @pcuenca , may I ask that is it finished now?

pcuenca commented 7 months ago

Working on it this week

Jaswanth-Devarinti commented 4 months ago

Hello @pcuenca, any updates on this?

Intiserahmed commented 3 months ago

HI @pcuenca Anything on this yet ?