haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.11k stars 2.21k forks source link

[Question] Can LLava inference on CPU? #865

Open wenli135 opened 11 months ago

wenli135 commented 11 months ago

Question

I was trying to run LLava inference on cpu, but it complains "Torch not compiled with CUDA enabled". I noticed that cuda() is called when loading model. If I remove all the cuda() invocation, is it possible to run inference on cpu?

thanks.

papasanimohansrinivas commented 11 months ago

you need to install torch cpu and set device map to cpu in model loading side @wenli135

morteza102030 commented 11 months ago

you need to install torch cpu and set device map to cpu in model loading side @wenli135

it's possible for you give a complete example for how run LLaVA_13b_4bit_vanilla_colab without gpu?

akkimind commented 11 months ago

I made some changes in the code to run inference on CPU, the model is loading but I am getting an error: BF16 weight prepack needs the cpu support avx512bw, avx512vl and avx512dq, please set dtype to torch.float or set weights_prepack to False while trying to optimize the model(model = ipex.optimize(model, dtype=torch.bfloat16)) If I set dtype to torch.float, model isn’t supporting it and if set weights_prepack to False, model is taking forever to return response. Is there any Specific CPU which I should use?

ratan commented 10 months ago

did anyone able to run Llava inference on CPU without installing Intel Extention for Pytorch environment for inference? Any pointer will be really helpful

feng-intel commented 9 months ago

Hi Ratan Here is the bare metal intel cpu solution intel xFasterTransformer for LLM, but there is no llava support yet. You can try firstly. llama.cpp also support CPU. We will enable intel dGPU/iGPU later.

Could you tell why you don't want to use Intel Extention for Pytorch? Thanks.

drzraf commented 2 months ago

Tried some of this paths:

So, natively, from HF:

  1. With low_cpu_mem_usage = False

    transformers/modeling_utils.py; ValueError: Passing along a device_map requires low_cpu_mem_usage=True

  2. With low_cpu_mem_usage = True

    You can't pass load_in_4bit or load_in_8bit as a kwarg when passing quantization_config argument at the same time (mentioned/replied to in https://github.com/haotian-liu/LLaVA/issues/1638)

  3. Fixing the above we get

    transformers/quantizers/quantizer_bnb_4bit.py : ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model

which I don't explain because using load_pretrained_model(load_4bit=True, device='cpu') leads to a device_map = {'': 'cpu'} which is quite clear. Still, we can bypass this adding llm_int8_enable_fp32_cpu_offload=True to the BitsAndBytesConfig, but does it makes any sense with load_4bit?). Well, anyway, it loads (took only 10 minutes)

=> blocked here.

Finally, regarding https://github.com/intel/xFasterTransformer, it's not quite clear whether it replace or complements intel-extension-for-pytorch [CPU/XPU] and for which specific hardware.

If any one could come up with answers/solutions for at least some of these, that'd be great.

feng-intel commented 2 months ago
  1. For intel-extention-for-pytorch, it supports llava fp32,bf16,int8,int4 on intel CPU, iGPU and dGPU. Any issue , you can report issue on here. Someone from Intel will help you.
  2. Ollama has supportted Intel CPU, iGPU, dGPU. You need to build from the source. The below is the llama3.1 steps for your reference.
    
    $ git clone https://github.com/ollama/ollama.git
    $ source intel/oneapi/setvars.sh

Install go

$ wget https://go.dev/dl/go1.23.0.linux-amd64.tar.gz $ mkdir ~/go_1.23.0 && tar zxf go1.23.0.linux-amd64.tar.gz -C ~/go_1.23.0 $ export PATH=$PATH:~/go_1.23.0/go/bin

$ cd ollama $ go generate ./... $ go build . # ollama binary will be generated.

Option to stop the before ollama service

$ ps -A |grep ollama $ netstat -aon |grep 11434 $ sudo service ollama stop

Start ollama server

$ OLLAMA_INTEL_GPU=1 ./ollama serve ##if no "OLLAMA_INTEL_GPU=1", it will run on cpu.

Start ollama client to test

Option 1

$ ./ollama run llama3.1

Option 2

$ curl --noproxy "localhost" http://localhost:11434/api/generate -d '{ "model": "llama3.1", "prompt":"Why is the sky blue?" }'