tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware
Apache License 2.0
18.57k stars 2.21k forks source link

How to run with M2 MPS #444

Open elcolie opened 1 year ago

elcolie commented 1 year ago

The code base is replies on nvidia/cuda. How can I use Apple MPS?

crux82 commented 1 year ago

It may not work correctly.

The code seems to support MPS (please look here: https://github.com/tloen/alpaca-lora/blob/8bb8579e403dc78e37fe81ffbb253c413007323f/generate.py#L53) but it also seems that pytorch/MPS do not support 16bit operations: https://github.com/pytorch/pytorch/issues/78168

I run the same code on a CUDA architecture and it provides a result, while on M2 it works (and it is quite fast) but the results are different.

Anyway, if replacing the loading procedure with

   model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL,
    )
    model = PeftModel.from_pretrained(
        model,
        ADAPTERS_MODEL,
    )

the results are numerically correct, but it is quite slow.

Does anyone else have the same problem?

x4080 commented 1 year ago

@crux82 any progress on this ?

crux82 commented 1 year ago

No, I am still trying to figure out the problem here: https://github.com/pytorch/pytorch/issues/96610#issuecomment-1564265703

x4080 commented 1 year ago

@crux82 thanks

crux82 commented 1 year ago

It actually runs with the last update of Pythorch, even though the results are not consistent with the ones obtained using the GPU :-(

I updated to the very last version of Pytorch.

pip3 install --upgrade --no-deps --force-reinstall --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu 

The software also runs using FP16.

model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL,
        device_map={"": device},
        torch_dtype=torch.float16,
    )
    model = PeftModel.from_pretrained(
        model,
        ADAPTERS_MODEL,
        device_map={"": device},
        torch_dtype=torch.float16,
    )

I have a model trained on a GPU that I would like to apply to new data. Unfortunately the model.generate() function generates different outputs from the environment with the GPU. Moreover, when I remove torch_dtype=torch.float16 from the code above, the code hangs in the generation process.

Has anyone been able to use this model?