hp1337 commented 7 months ago

Describe the issue

Issue: Multiple GPU inference is broken with LLaVA 1.6. Same command with model liuhaotian/llava-v1.5-13b works fine.

Command:

CUDA_VISIBLE_DEVICES=0,1 python -m llava.serve.cli --model-path ../models/liuhaotian_llava-v1.6-34b --load-4bit --image-file foo.png

Log:

You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors. Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:20<00:00, 1.38s/it] USER: What is the document? Traceback (most recent call last): File "/home//miniconda3/envs/llava/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home//miniconda3/envs/llava/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home//LLaVA/llava/serve/cli.py", line 126, in main(args) File "/home//LLaVA/llava/serve/cli.py", line 95, in main output_ids = model.generate( File "/home//miniconda3/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home//LLaVA/llava/model/language_model/llava_llama.py", line 125, in generate ) = self.prepare_inputs_labels_for_multimodal( File "/home//LLaVA/llava/model/llava_arch.py", line 181, in prepare_inputs_labels_for_multimodal image_feature = torch.cat(( RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

I will update if I can figure out where the bug is. Thank you.

nkarpovdb commented 7 months ago

+1 same

(llava) root@0119-194908-8ernbatz-10-68-129-235:~/LLaVA# CUDA_VISIBLE_DEVICES=0,1 python -m llava.serve.cli --model-path liuhaotian/llava-v1.6-34b --load-4bit --image-file "https://llava-vl.github.io/static/images/view.jpg"
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:14<00:00,  1.03it/s]
USER: describe this image
Traceback (most recent call last):
  File "/root/miniconda3/envs/llava/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/llava/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/LLaVA/llava/serve/cli.py", line 126, in <module>
    main(args)
  File "/root/LLaVA/llava/serve/cli.py", line 95, in main
    output_ids = model.generate(
  File "/root/miniconda3/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/LLaVA/llava/model/language_model/llava_llama.py", line 125, in generate
    ) = self.prepare_inputs_labels_for_multimodal(
  File "/root/LLaVA/llava/model/llava_arch.py", line 181, in prepare_inputs_labels_for_multimodal
    image_feature = torch.cat((
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

samidten commented 7 months ago

+1 same yes, please help fix this!

Silviase commented 6 months ago

+1 same, please fix this

thisthrowaway commented 6 months ago

+1 But I got it working by passing --device cuda:0 when creating the model_worker for testing the model. Doesn't fix the issue, only circumvents the bug by using only one gpu.

yhygta commented 6 months ago

+1 same, please fix this

wassimea commented 6 months ago

+1 same

haotian-liu commented 6 months ago

Hi all, sorry for the inference. Please pull the latest branch and it shall be fixed in https://github.com/haotian-liu/LLaVA/pull/1057.

samidten commented 6 months ago

thanks for trying to fix. now the model loads ok to multi-gpus, however during inference it throws this error:

2024-02-02 22:43:31 | ERROR | stderr | Exception in thread Thread-4:
2024-02-02 22:43:31 | ERROR | stderr | Traceback (most recent call last):
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
2024-02-02 22:43:31 | ERROR | stderr |     self.run()
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/lib/python3.9/threading.py", line 892, in run
2024-02-02 22:43:31 | ERROR | stderr |     self._target(*self._args, **self._kwargs)
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-02-02 22:43:31 | ERROR | stderr |     return func(*args, **kwargs)
2024-02-02 22:43:31 | ERROR | stderr |   File "/root/LLaVA/llava/model/language_model/llava_llama.py", line 125, in generate
2024-02-02 22:43:31 | ERROR | stderr |     ) = self.prepare_inputs_labels_for_multimodal(
2024-02-02 22:43:31 | ERROR | stderr |   File "/root/LLaVA/llava/model/llava_arch.py", line 157, in prepare_inputs_labels_for_multimodal
2024-02-02 22:43:31 | ERROR | stderr |     image_features = self.encode_images(concat_images)
2024-02-02 22:43:31 | ERROR | stderr |   File "/root/LLaVA/llava/model/llava_arch.py", line 141, in encode_images
2024-02-02 22:43:31 | ERROR | stderr |     image_features = self.get_model().get_vision_tower()(images)
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-02-02 22:43:31 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
2024-02-02 22:43:31 | ERROR | stderr |     output = old_forward(*args, **kwargs)
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-02-02 22:43:31 | ERROR | stderr |     return func(*args, **kwargs)
2024-02-02 22:43:31 | ERROR | stderr |   File "/root/LLaVA/llava/model/multimodal_encoder/clip_encoder.py", line 54, in forward
2024-02-02 22:43:31 | ERROR | stderr |     image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-02-02 22:43:31 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 160, in new_forward
2024-02-02 22:43:31 | ERROR | stderr |     args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 290, in pre_forward
2024-02-02 22:43:31 | ERROR | stderr |     return send_to_device(args, self.execution_device), send_to_device(
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/operations.py", line 151, in send_to_device
2024-02-02 22:43:31 | ERROR | stderr |     return honor_type(
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/operations.py", line 83, in honor_type
2024-02-02 22:43:31 | ERROR | stderr |     return type(obj)(generator)
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/operations.py", line 152, in <genexpr>
2024-02-02 22:43:31 | ERROR | stderr |     tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
2024-02-02 22:43:31 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/operations.py", line 167, in send_to_device
2024-02-02 22:43:31 | ERROR | stderr |     return tensor.to(device, non_blocking=non_blocking)
2024-02-02 22:43:31 | ERROR | stderr | NotImplementedError: Cannot copy out of meta tensor; no data!

haotian-liu commented 6 months ago

@samidten

Can you please share your package list pip list as well as the command you run the inference? Thank you.

samidten commented 6 months ago

python3 -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:20001 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.6-34b

python3 -m llava.serve.gradio_web_server --controller http://localhost:20001 --model-list-mode reload --port 8000 inference was done via gradio site

# pip list
Package                   Version
------------------------- ------------
accelerate                0.21.0
aiofiles                  23.2.1
aiohttp                   3.9.1
aiosignal                 1.3.1
altair                    5.2.0
anyio                     4.2.0
async-timeout             4.0.3
attrs                     23.2.0
bitsandbytes              0.41.0
certifi                   2023.11.17
charset-normalizer        3.3.2
click                     8.1.7
cmake                     3.28.1
contourpy                 1.2.0
cycler                    0.12.1
dbus-python               1.2.16
distro-info               1.0
einops                    0.6.1
einops-exts               0.0.4
exceptiongroup            1.2.0
fastapi                   0.109.0
ffmpy                     0.3.1
filelock                  3.13.1
fonttools                 4.47.2
frozenlist                1.4.1
fsspec                    2023.12.2
gradio                    3.35.2
gradio-client             0.2.9
h11                       0.14.0
httpcore                  0.17.3
httpx                     0.24.0
huggingface-hub           0.20.3
idna                      3.6
importlib-resources       6.1.1
Jinja2                    3.1.3
joblib                    1.3.2
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
linkify-it-py             2.0.2
lit                       17.0.6
llava                     1.2.0
markdown-it-py            2.2.0
markdown2                 2.4.12
MarkupSafe                2.1.4
matplotlib                3.8.2
mdit-py-plugins           0.3.3
mdurl                     0.1.2
mercurial                 5.6.1
mpmath                    1.3.0
multidict                 6.0.4
networkx                  3.2.1
numpy                     1.26.3
nvidia-cublas-cu11        11.10.3.66
nvidia-cuda-cupti-cu11    11.7.101
nvidia-cuda-nvrtc-cu11    11.7.99
nvidia-cuda-runtime-cu11  11.7.99
nvidia-cudnn-cu11         8.5.0.96
nvidia-cufft-cu11         10.9.0.58
nvidia-curand-cu11        10.2.10.91
nvidia-cusolver-cu11      11.4.0.1
nvidia-cusparse-cu11      11.7.4.91
nvidia-nccl-cu11          2.14.3
nvidia-nvtx-cu11          11.7.91
orjson                    3.9.12
packaging                 23.2
pandas                    2.2.0
peft                      0.4.0
pillow                    10.2.0
pip                       20.3.4
psutil                    5.9.8
pycurl                    7.43.0.6
pydantic                  1.10.14
pydub                     0.25.1
pygments                  2.17.2
PyGObject                 3.38.0
pyparsing                 3.1.1
python-apt                2.2.1
python-dateutil           2.8.2
python-multipart          0.0.6
pytz                      2023.3.post1
PyYAML                    6.0.1
referencing               0.32.1
regex                     2023.12.25
requests                  2.31.0
rpds-py                   0.17.1
safetensors               0.4.2
scikit-learn              1.2.2
scipy                     1.12.0
semantic-version          2.10.0
sentencepiece             0.1.99
setuptools                52.0.0
shortuuid                 1.0.11
six                       1.16.0
sniffio                   1.3.0
starlette                 0.35.1
svgwrite                  1.4.3
sympy                     1.12
threadpoolctl             3.2.0
timm                      0.6.13
tokenizers                0.15.0
toolz                     0.12.1
torch                     2.0.1
torchvision               0.15.2
tqdm                      4.66.1
transformers              4.36.2
triton                    2.0.0
typing-extensions         4.9.0
tzdata                    2023.4
uc-micro-py               1.0.2
unattended-upgrades       0.1
urllib3                   2.1.0
uvicorn                   0.27.0
wavedrom                  2.0.3.post3
websockets                12.0
wheel                     0.34.2
yarl                      1.9.4
zipp                      3.17.0

haotian-liu commented 6 months ago

@samidten

I just tried building the env from scratch and using the same command as yours, I do not meet the issue. I've also tried using 4x 3090 to serve llava-v1.6-34b; as well as using 1x 3090 to serve it in 4-bit mode. Both works.

Can you confirm the hardware that you are working with and how much VRAM do you have?

I would recommend re-building the env from scratch as well. Also, you can try if 4-bit works: add --load-4bit to the end of your command line.

samidten commented 6 months ago

i tried with clean venv but still same issue. 4-bit works with 2 cards... lets see if others have similar issue with multi-gpus with latest branch

# nvidia-smi
Sat Feb  3 00:21:05 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:17:00.0 Off |                    0 |
| N/A   28C    P0              31W / 165W |  22533MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     On  | 00000000:CA:00.0 Off |                    0 |
| N/A   30C    P0              31W / 165W |  23733MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

haotian-liu commented 6 months ago

@samidten

This should be related to the OOM issue. 34B shall require at least 80GB VRAM to serve, and given your GPUs, it may be better to just use them with 4-bit quantization.

LumenYoung commented 6 months ago

Sorry to reopen this issue but I have the same problem again and I'm sure my device is not OOMed since I used my machine to run smoothly the 13b model previously.

The problem occurs here, where I saw your patch only added the to(device) in the other branch.

I'm using llava1.6 13b at 4bit quantization and it is quiet interesting that I failed to change the device of the self.model.image_newline with to method.

(Pdb) print(self.model.image_newline.to("cuda"))
tensor([-0.0226, -0.0078, -0.0162,  ..., -0.0112,  0.0264, -0.0170],
       device='cuda:1', dtype=torch.float16)
(Pdb) p self.model.image_newline.device
device(type='cuda', index=0)
(Pdb) torch.cat((image_feature,self.model.image_newline[None]),dim=0)
*** RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

I would like to seek for your advice on how to fix this. @haotian-liu

haotian-liu commented 6 months ago

Thanks for reporting. Does https://github.com/haotian-liu/LLaVA/commit/b42a13d14b6118381a667430a5b8c50f9790dee3 fix it?

LumenYoung commented 6 months ago

Thanks for reporting. Does b42a13d fix it?

Thanks a lot Haotian. Yes this fixed my problem. Interestingly though, is that my previous patch was only a bit different from your patch:

                        if 'unpad' in mm_patch_merge_type:
                            self.model.image_newline.to(image_feature.device)
                            image_feature = torch.cat((
                                image_feature,
                                self.model.image_newline[None]
                            ), dim=0)

Which was pretty similar in my opinion. But that didn't fix it, the device problem still persists after my patch. Do you know if there is a reason behind this different behavior?

haotian-liu commented 6 months ago

tensor_on_device = tensor.to(device)

.to is not an inplace operator

LumenYoung commented 6 months ago

tensor_on_device = tensor.to(device)
.to is not an inplace operator

Thanks for your prompt reply. Okey, I should have remember it :).

shllgtca commented 4 months ago

Hi!

I've been having the same problem. I did a workaround as others to set just one gpu, although there's 2 available. Could you help me out on how to run in multiple gpus?

My work around:


import os
os.environ["PYTORCH_USE_CUDA_DSA"] = "1"
os.environ['TOKENIZERS_PARALLELISM']='TRUE'
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
os.environ["PYTORCH_USE_CUDA_DSA"] = "1"
import sys
sys.path.append('/home/experiment/volume/rag_multimodal/main/models/LLaVA' )
print(sys.path)
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
from llava.mm_utils import (process_images, tokenizer_image_token)
from llava.constants import (IMAGE_TOKEN_INDEX)
import cv2
import torch
torch.cuda.empty_cache()
from PIL import Image 

def llavaRunner():
    image = cv2.imread("/home/experiment/volume/rag_multimodal/main/models/DeepSeek-VL/images/training_pipelines.jpg")
    images = [Image.fromarray(image)]

    model_path = "liuhaotian/llava-v1.5-7b"
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    tokenizer, model, image_processor, context_len = load_pretrained_model(
        model_path=model_path,
        model_base=None,
        model_name=get_model_name_from_path(model_path),device=device
    )

    prompt='describe the image'
    image_sizes = [x.size for x in images]
    images_tensor = process_images(
        images,
        image_processor,
        model.config
    ).to(model.device, dtype=torch.float16)

    input_ids = (
        tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
        .unsqueeze(0)
        .cuda()
    ).to(device)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=images_tensor,
            image_sizes=image_sizes,
            do_sample=True,#True if args.temperature > 0 else False,
            temperature=0.2,
            top_p=None,
            num_beams=1,
            max_new_tokens=512,
            use_cache=True,
        )

    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
    print(outputs)

###################################################################

I also got the latest transformers which you can check on my pip freeze:

pip freeze accelerate==0.21.0 aiofiles==23.2.1 altair==5.3.0 annotated-types==0.6.0 anyio==4.3.0 asttokens==2.4.1 attrs==23.2.0 backcall==0.2.0 bitsandbytes==0.43.1 certifi==2024.2.2 charset-normalizer==3.3.2 click==8.1.7 comm==0.2.2 contourpy==1.1.1 cycler==0.12.1 debugpy==1.8.1 decorator==5.1.1 einops==0.6.1 einops-exts==0.0.4 exceptiongroup==1.2.0 executing==2.0.1 fastapi==0.110.1 ffmpy==0.3.2 filelock==3.13.4 fonttools==4.51.0 fsspec==2024.3.1 gradio==4.16.0 gradio_client==0.8.1 h11==0.14.0 httpcore==0.17.3 httpx==0.24.0 huggingface-hub==0.22.2 idna==3.7 importlib_metadata==7.1.0 importlib_resources==6.4.0 ipykernel==6.29.4 ipython==8.12.3 jedi==0.19.1 Jinja2==3.1.3 joblib==1.4.0 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 jupyter_client==8.6.1 jupyter_core==5.7.2 kiwisolver==1.4.5

Editable Git install with no remote (llava==1.2.2.post1)

-e /home/experiment/volume/rag_multimodal/main/models/LLaVA markdown-it-py==3.0.0 markdown2==2.4.13 MarkupSafe==2.1.5 matplotlib==3.7.5 matplotlib-inline==0.1.7 mdurl==0.1.2 mpmath==1.3.0 nest-asyncio==1.6.0 networkx==3.1 numpy==1.24.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.1.105 opencv-python==4.9.0.80 orjson==3.10.0 packaging==24.0 pandas==2.0.3 parso==0.8.4 peft==0.10.0 pexpect==4.9.0 pickleshare==0.7.5 pillow==10.3.0 pkgutil_resolve_name==1.3.10 platformdirs==4.2.0 prompt-toolkit==3.0.43 protobuf==5.26.1 psutil==5.9.8 ptyprocess==0.7.0 pure-eval==0.2.2 pydantic==2.7.0 pydantic_core==2.18.1 pydub==0.25.1 Pygments==2.17.2 pyparsing==3.1.2 python-dateutil==2.9.0.post0 python-multipart==0.0.9 pytz==2024.1 PyYAML==6.0.1 pyzmq==26.0.0 referencing==0.34.0 regex==2023.12.25 requests==2.31.0 rich==13.7.1 rpds-py==0.18.0 ruff==0.3.7 safetensors==0.4.3 scikit-learn==1.2.2 scipy==1.10.1 semantic-version==2.10.0 sentencepiece==0.1.99 shellingham==1.5.4 shortuuid==1.0.13 six==1.16.0 sniffio==1.3.1 stack-data==0.6.3 starlette==0.37.2 svgwrite==1.4.3 sympy==1.12 threadpoolctl==3.4.0 timm==0.6.13 tokenizers==0.15.1 tomlkit==0.12.0 toolz==0.12.1 torch==2.1.2 torchvision==0.16.2 tornado==6.4 tqdm==4.66.2 traitlets==5.14.2 transformers==4.37.2 triton==2.1.0 typer==0.12.3 typing_extensions==4.11.0 tzdata==2024.1 urllib3==2.2.1 uvicorn==0.29.0 wavedrom==2.0.3.post3 wcwidth==0.2.13 websockets==11.0.3 zipp==3.18.1

haotian-liu / LLaVA

Multiple GPU inference is broken with LLaVA 1.6 #1050

Describe the issue

Editable Git install with no remote (llava==1.2.2.post1)