Open wenli135 opened 11 months ago
you need to install torch cpu and set device map to cpu in model loading side @wenli135
you need to install torch cpu and set device map to cpu in model loading side @wenli135
it's possible for you give a complete example for how run LLaVA_13b_4bit_vanilla_colab without gpu?
I made some changes in the code to run inference on CPU, the model is loading but I am getting an error:
BF16 weight prepack needs the cpu support avx512bw, avx512vl and avx512dq, please set dtype to torch.float or set weights_prepack to False
while trying to optimize the model(model = ipex.optimize(model, dtype=torch.bfloat16))
If I set dtype to torch.float, model isn’t supporting it and if set weights_prepack to False, model is taking forever to return response. Is there any Specific CPU which I should use?
did anyone able to run Llava inference on CPU without installing Intel Extention for Pytorch environment for inference? Any pointer will be really helpful
Hi Ratan Here is the bare metal intel cpu solution intel xFasterTransformer for LLM, but there is no llava support yet. You can try firstly. llama.cpp also support CPU. We will enable intel dGPU/iGPU later.
Could you tell why you don't want to use Intel Extention for Pytorch? Thanks.
Tried some of this paths:
convert_hf_to_gguf.py
the HF model (to process LlavaMistralForCausalLM
the way it does for LlamaForCausalLM
), but stumbled upon other problems (Can not map tensor 'model.image_newline'
)So, natively, from HF:
With low_cpu_mem_usage = False
transformers/modeling_utils.py
;ValueError: Passing along a device_map requires low_cpu_mem_usage=True
With low_cpu_mem_usage = True
You can't pass load_in_4bit or load_in_8bit as a kwarg when passing quantization_config argument at the same time
(mentioned/replied to in https://github.com/haotian-liu/LLaVA/issues/1638)
Fixing the above we get
transformers/quantizers/quantizer_bnb_4bit.py
:ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model
which I don't explain because using load_pretrained_model(load_4bit=True, device='cpu')
leads to a device_map = {'': 'cpu'}
which is quite clear. Still, we can bypass this adding llm_int8_enable_fp32_cpu_offload=True
to the BitsAndBytesConfig
, but does it makes any sense with load_4bit
?). Well, anyway, it loads (took only 10 minutes)
Now comes intel-extension-for-pytorch which indeed has a config for this model.
Whether ipex.optimize(inplace=True)
is passed or not (if not, memory footprint is doubled), we get
RuntimeError: could not create a primitive descriptor for a convolution forward propagation primitive
=> blocked here.
Finally, regarding https://github.com/intel/xFasterTransformer, it's not quite clear whether it replace or complements intel-extension-for-pytorch [CPU/XPU] and for which specific hardware.
If any one could come up with answers/solutions for at least some of these, that'd be great.
$ git clone https://github.com/ollama/ollama.git
$ source intel/oneapi/setvars.sh
$ wget https://go.dev/dl/go1.23.0.linux-amd64.tar.gz $ mkdir ~/go_1.23.0 && tar zxf go1.23.0.linux-amd64.tar.gz -C ~/go_1.23.0 $ export PATH=$PATH:~/go_1.23.0/go/bin
$ cd ollama $ go generate ./... $ go build . # ollama binary will be generated.
$ ps -A |grep ollama $ netstat -aon |grep 11434 $ sudo service ollama stop
$ OLLAMA_INTEL_GPU=1 ./ollama serve ##if no "OLLAMA_INTEL_GPU=1", it will run on cpu.
$ ./ollama run llama3.1
$ curl --noproxy "localhost" http://localhost:11434/api/generate -d '{ "model": "llama3.1", "prompt":"Why is the sky blue?" }'
Question
I was trying to run LLava inference on cpu, but it complains "Torch not compiled with CUDA enabled". I noticed that cuda() is called when loading model. If I remove all the cuda() invocation, is it possible to run inference on cpu?
thanks.