m1 max still have issues in the final step[Usage]

bioinfomagic commented 10 months ago

Describe the issue

Issue:

I have enable the m1 chip using the --device mps but still have the errors.

Command:

python3 -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file "https://llava-vl.github.io/static/images/view.jpg" \
    --load-4bit --device mps

Log:

/opt/homebrew/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
[2023-10-30 01:27:00,711] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.28s/it]
USER: hello
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/bidetime/Research/Projects/LLaVA/llava/serve/cli.py", line 125, in <module>
    main(args)
  File "/Users/bidetime/Research/Projects/LLaVA/llava/serve/cli.py", line 87, in main
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
ASSISTANT: %

Screenshots: You may attach screenshots if it better explains the issue.

haotian-liu commented 10 months ago

macOS support is updated just now, with quantization coming later. Please pull the latest code base and install/run following the instructions here. You may also try https://github.com/ggerganov/llama.cpp/pull/3436.

bioinfomagic commented 10 months ago

macOS support is updated just now, with quantization coming later. Please pull the latest code base and install/run following the instructions here. You may also try ggerganov/llama.cpp#3436.

Thnanks very much, I have run the following, to update to the lastest git repo, git pull pip install -e .

And have the output, Successfully installed llava-1.1.3

and start the process again, but the same issue persist, text interaction works, once load the pics, the error pops out. NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.

haotian-liu commented 10 months ago

@bioinfomagic what about try with the cli?

python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file "https://llava-vl.github.io/static/images/view.jpg" \
    --device mps

bioinfomagic commented 10 months ago

@bioinfomagic what about try with the cli?

python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file "https://llava-vl.github.io/static/images/view.jpg" \
    --device mps

Oh, thanks very much Haotian, it works on my mac now.

haotian-liu commented 10 months ago

@bioinfomagic

Btw, how is the speed on M1 Max and how much RAM do you have?

bioinfomagic commented 10 months ago

@bioinfomagic

Btw, how is the speed on M1 Max and how much RAM do you have?

The webUI works perfectly, now I can run LLaVA on my mac with similar output as the LLaVA online demo. Just with a slower speed. I somehow used 85GB to 125GB of ram (depends on the run time) and the speed is around 2~3 words per second. I can see the CPU usage shows 105% GPU usage is around 35% according to the activity monitor. In regarding of speed, I can see other LLM apps, when I choose metal enabled I can see the CPU load drops to 30% and GPU loads gose up to 100% or 200% and the speed is much faster same as online chatgpt 10 words per second.

For the CLI it seems still have issues, not sure if it is my problem, but webUI works very well.

-m llava.serve.cli --model-path liuhaotian/llava-v1.5-7b --image-file "https://llava-vl.github.io/static/images/view.jpg" --device mps [2023-10-31 20:28:04,996] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.82s/it] USER: hello Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/LLaVA/llava/serve/cli.py", line 125, in main(args) File "/LLaVA/llava/serve/cli.py", line 87, in main input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/torch/cuda/init.py", line 289, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ASSISTANT: %

But If I add the --load4bit, it won't load,

python3 -m llava.serve.cli --model-path liuhaotian/llava-v1.5-7b --image-file "https://llava-vl.github.io/static/images/view.jpg" --load-4bit --device mps [2023-10-31 20:26:52,172] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) Traceback (most recent call last): File "/opt/homebrew/Cellar/python@3.11/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/metadata/init.py", line 563, in from_name return next(cls.discover(name=name)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "//LLaVA/llava/serve/cli.py", line 125, in main(args) File "/LLaVA/llava/serve/cli.py", line 32, in main tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/LLaVA/llava/model/builder.py", line 33, in load_pretrained_model kwargs['quantization_config'] = BitsAndBytesConfig( ^^^^^^^^^^^^^^^^^^^

haotian-liu commented 10 months ago

It seems that you haven't pulled the latest code base:

https://github.com/haotian-liu/LLaVA/blob/main/llava/serve/cli.py#L87

bioinfomagic commented 10 months ago

It seems that you haven't pulled the latest code base:

https://github.com/haotian-liu/LLaVA/blob/main/llava/serve/cli.py#L87

Thank you very much, you are right, I somehow didn't update the code base correctly, now I updated it works perfectly now for both CLI and WebUI.

python3 -m llava.serve.cli --model-path liuhaotian/llava-v1.5-7b --image-file "https://llava-vl.github.io/static/images/view.jpg" --device mps [2023-10-31 20:50:05,977] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.69s/it] USER: hey ASSISTANT: Hello! How can I help you today? USER: summarize the pic

haotian-liu commented 10 months ago

Hmmm. I previously thought it was because of my poor M2 16GB, so it seems that the MPS still needs some optimization.

bioinfomagic commented 10 months ago

Hmmm. I previously thought it was because of my poor M2 16GB, so it seems that the MPS still needs some optimization.

I totally agree, I can see LMstudio has a much better metal support, it runs 70B model much faster, when metal enable, clearly noticed the speed difference.

haotian-liu / LLaVA

m1 max still have issues in the final step[Usage] #704

Describe the issue