tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

Tracking issue for Mac support #4

Open pannous opened 1 year ago

pannous commented 1 year ago

M1 / M2 32GB … 128GB any hopes?

remixer-dec commented 1 year ago

No luck with this repo, "bitsandbytes" dependency is heavily relying on CUDA. But there is a repo for cpu inference, just change the prompts to prompts[0], so it doesn't crash with max_batch_size=1.
It takes more than 10 minutes to produce output with max_gen_len=20, even GPT-J 5B took me around a minute on CPU. I also tried to make an MPS port with gpu acceleration, it works faster, but the output is not good enough imo, not sure if it is always good on cpu or if I just got lucky on my first generation. UPDATE: the model gives good outputs with python3.10 + pytorch-nightly

pannous commented 1 year ago

thanks!

remixer-dec commented 1 year ago

Actually, I was wrong. After I tried my port with a higher version of python+pytorch, the outputs were as good as the cpu ones, I am happy that it worked after all!