Load InstructBLIP Across Multiple GPUs?

petermcc0807 commented 1 year ago

Hi,

Is it possible to load InstructBLIP (Vicuna 13B) across multiple (e.g. 4x16GB) GPUs?

LLaVA (which also uses Vicuna 13B) enables the number of GPUs to be specified. InstructBLIP's load_model_and_preprocess() doesn't appear to enable this, from what I can tell.

Thanks!

petermcc0807 commented 1 year ago

Hi Hangyu,

Many thanks for your help. Based on your sample code (for Vicuna 13B), I have the following:

device = torch.device('cpu')

model, vis_processors, txt_processors = load_model_and_preprocess(name='blip2_vicuna_instruct', model_type='vicuna13b', is_eval=True, device=device)

# device_map = infer_auto_device_map(model)

# Based on my 4x16GB GPU instance

max_memory = { 0: '13GiB', 1: '13GiB', 2: '13GiB', 3: '13GiB', 'cpu': '32GiB' }

device_map = infer_auto_device_map(model, max_memory=max_memory)

model = dispatch_model(model, device_map=device_map)

torch.cuda.empty_cache()

model.eval()

if torch.__version__ >= '2' and sys.platform != 'win32':
    model = torch.compile(model)

device = torch.device('cuda')

image = vis_processors['eval'](image).unsqueeze(0).to(device)

Everything works fine. However, when I perform an inference:

model.generate({ 'image': image, 'prompt': 'What is unusual about this image?' })

I get the following error:

_RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDAgather)

How can I fix this? Thanks again.

pyogher commented 1 year ago

Hi there,

I'm glad to see that the issue has been resolved. As you've noticed, this new error arises from a discrepancy between the intermediate variable and the model, which are not located on the same device. This could be attributed to the accelerate library not being fully compatible with all the latest models. To tackle this kind of problem, I recommend the following suggestions:

Make sure that the parameters of the embedding matrix and the lm_head are situated on the same device. You can manually adjust the device_map based on the model.hf_device_map to achieve this.
Within the infer_auto_device_map() function, specify the no_split_module_classes=["xxx"], where "xxx" represents the name of the language model layers. For more detailed instructions, you can refer to the following link: https://huggingface.co/docs/accelerate/v0.19.0/en/usage_guides/big_modeling#limits-and-further-development.
If the error persists, I recommend following the compiler prompts and modifying the source code to ensure that the data tensor and the device model are consistent. After implementing the above two suggestions, generally, you will only need to make minimal changes to the source code.

petermcc0807 commented 1 year ago

This is great. Thanks again for your help, Hangyu.

jun297 commented 1 year ago

@petermcc0807 @pygh0er Hi, may I get the code? I'm having trouble with the same issue as well It would be a great help

petermcc0807 commented 1 year ago

Hi @jun297,

Unfortunately, I could not get this to work. I tried everything, even manually shifting around device map entries - for example:

device_map = infer_auto_device_map(model)

device_map['llm_proj'] = 0 device_map['llm_model.model.norm'] = 0 device_map['llm_model.lm_head'] = 0 . . .

Nothing worked, so I gave up. I am now using LLaVA with Vicuna 13B instead. Sorry I could not be of more help.

jun297 commented 1 year ago

Thanks for the reply @petermcc0807 I succeeded in LLaVA either, but it's tricky for this one. I failed as well.

pyogher commented 1 year ago

Hi! this is my previous code for loading blip, and it works fine for me. You can give it a try.

import subprocess, re

def get_gpu_with_max_free_memory():
    command = "nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits"
    output = subprocess.check_output(command.split(), universal_newlines=True)
    gpu_memory_info = output.strip().split("\n")
    gpu_memory_info = [int(re.findall(r"\d+", info)[0]) for info in gpu_memory_info]
    gpu_with_max_free_memory = gpu_memory_info.index(max(gpu_memory_info))
    return gpu_with_max_free_memory

def get_blip_model(device='cuda', dtype=torch.bfloat16, use_multi_gpus=False):
    cuda_number = get_gpu_with_max_free_memory()
    print(f'cuda number: {cuda_number}')
    model, vis_processors, txt_processors = load_model_and_preprocess(
                    name='blip2_vicuna_instruct',
                    model_type='vicuna7b',
                    is_eval=True,
                    device=f'cuda:{cuda_number}',
                )
    if use_multi_gpus:
        device_map = infer_auto_device_map(model, max_memory={1: "3GiB", 2: "3GiB", 3: "4GiB", 4: "4GiB", 5: "4GiB"}, no_split_module_classes=['LlamaDecoderLayer', 'VisionTransformer'])
        device_map['llm_model.model.embed_tokens'] = device_map['llm_model.lm_head'] = device_map['llm_proj'] = 1
        print(device_map)
        model = dispatch_model(model, device_map=device_map)
        torch.cuda.empty_cache()
    model.eval()
    if torch.__version__ >= "2" and sys.platform != "win32":
        model = torch.compile(model)
    return model, (txt_processors, vis_processors)

jun297 commented 1 year ago

@pygh0er Thank you!

joskfg commented 1 year ago

@pygh0er I was not able to load it in a 2x16GB GPUs. What was your setup?

pyogher commented 1 year ago

@joskfg Hi, you can use the following code:

def get_blip_model(device='cuda', dtype=torch.bfloat16, use_multi_gpus=True):
    model, vis_processors, txt_processors = load_model_and_preprocess(
                    name='blip2_vicuna_instruct',
                    model_type='vicuna7b',
                    is_eval=True,
                )
    model.to(dtype)
    if use_multi_gpus:
        device_map = infer_auto_device_map(model, max_memory={0: "8GiB", 1: "9GiB"}, no_split_module_classes=['LlamaDecoderLayer', 'VisionTransformer'])
        device_map['llm_model.model.embed_tokens'] = device_map['llm_model.lm_head'] = device_map['llm_proj'] = 1
        print(device_map)
        model = dispatch_model(model, device_map=device_map)
        torch.cuda.empty_cache()
    else:
        model.to('cuda:0')
    model.eval()
    if torch.__version__ >= "2" and sys.platform != "win32":
        model = torch.compile(model)
    return model, (txt_processors, vis_processors)

LixDemon commented 1 year ago

@pygh0er Have you tried to load the InstructBLIP of vicuna13b? I encountered the problem of tensors on different devices. Could you please release the code? That will be really helpful!!! Thanks a lot!!!

petermcc0807 commented 1 year ago

My original question was about Vicuna 13B (vicuna13b), not 7B.

I've tried all the code changes suggested previously and tried a range of different device_map configurations. Although I can load the 13B model, inference fails every time with errors.

I'm trying to run this on an AWS g4dn.12xlarge instance (4 GPUs, 64GB in total).

I can run LLaVA (also Vicuna 13B) on this instance.

pyogher commented 1 year ago

@LixDemon Hi,

I haven't used the 13B version of instructblip myself, but if you're facing the issue of "tensors on different devices," it's likely to be the same reason as instructblip-vicuna-7b and other LVLMs.

Here are a couple of points to consider:

I recommend checking the relevant source code, which should be in LAVIS/lavis/models/blip2_models and the modeling_lamma.py file in the transformers official repository. You need to identify which classes in the model cannot be split. For example, in instructblip, the LlamaDecoderLayer class should not be placed on different GPUs, as it may cause the issue of "tensors on different devices" due to residual modules. So, for Instructblip, I suggest setting no_split_module_classes=['LlamaDecoderLayer', 'VisionTransformer'] in the device_map parameter like this: device_map = infer_auto_device_map(model, max_memory={0: "13GiB", 1: "14GiB"}, no_split_module_classes=['LlamaDecoderLayer', 'VisionTransformer']).
You should ensure that llm's proj_layer, embedding layer, and lm_head are on the same device to ensure that input and output data are on a device. Therefore, you need to set device_map['llm_model.lm_head'] = device_map['llm_proj'] = device_map['llm_model.model.embed_tokens'].

Based on these two principles, I guess the following code should correctly load the 13B model:

def get_blip_model(device='cuda', dtype=torch.bfloat16, use_multi_gpus=True):
    model, vis_processors, txt_processors = load_model_and_preprocess(
                    name='blip2_vicuna_instruct',
                    model_type='vicuna13b',
                    is_eval=True,
                )
    model.to(dtype)
    if use_multi_gpus:
        device_map = infer_auto_device_map(model, max_memory={0: "13GiB", 1: "14GiB"}, no_split_module_classes=['LlamaDecoderLayer', 'VisionTransformer'])
        device_map['llm_model.lm_head'] = device_map['llm_proj'] = device_map['llm_model.model.embed_tokens']
        print(device_map)
        model = dispatch_model(model, device_map=device_map)
        torch.cuda.empty_cache()
    else:
        model.to('cuda:0')
    model.eval()
    if torch.__version__ >= "2" and sys.platform != "win32":
        model = torch.compile(model)
    return model, (txt_processors, vis_processors)

saffie91 commented 1 year ago

I have managed to load the model using this code @pyogher but I am getting this error now when I try to run the model.generate: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat) Is it possible some other modules need to be not split? Or does it have something to do with the input?

petermcc0807 commented 1 year ago

@saffie91 I'm in the same boat as you.

Back to a previous comment of mine, I've tried all the code changes suggested and tried a range of different device_map configurations. Although I've managed to load the Vicuna 13B model across GPUs, inference fails every time with errors (i.e. "Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!"). And I've been trying to run this on an AWS g4dn.12xlarge instance (4 GPUs, 64GB in total).

InstructBLIP is now available over at Hugging Face. Although I can get the 13B model to perform inference, it's CPU only - so, it's sloooow. When I try the GPU route, inference fails with the same old errors.

Interestingly, LLaVA (a similar multimodal which uses Vicuna 13B) works fine across multiple GPUs.

pyogher commented 1 year ago

@saffie91 Hi!

There shouldn't be any other modules that cannot be split. I suggest adding BertLMHeadModel (Qformer) as additional modules that cannot be split. You can also print your device_map for us to take a look.

saffie91 commented 1 year ago

@pyogher Hi! Thanks for getting back to me so quickly!

I did add BertLMHeadModel as well and the results are the same.

I am using the g5.12x on aws, with 4 A10G's.

this is my device map:

{'query_tokens': 1, 'visual_encoder': 1, 'ln_vision': 1, 'Qformer': 1, 'llm_model.model.embed_tokens': 1, 'llm_model.model.layers.0': 1, 'llm_model.model.layers.1': 1, 'llm_model.model.layers.2': 1, 'llm_model.model.layers.3': 1, 'llm_model.model.layers.4': 1, 'llm_model.model.layers.5': 1, 'llm_model.model.layers.6': 1, 'llm_model.model.layers.7': 1, 'llm_model.model.layers.8': 1, 'llm_model.model.layers.9': 1, 'llm_model.model.layers.10': 1, 'llm_model.model.layers.11': 2, 'llm_model.model.layers.12': 2, 'llm_model.model.layers.13': 2, 'llm_model.model.layers.14': 2, 'llm_model.model.layers.15': 2, 'llm_model.model.layers.16': 2, 'llm_model.model.layers.17': 2, 'llm_model.model.layers.18': 2, 'llm_model.model.layers.19': 2, 'llm_model.model.layers.20': 2, 'llm_model.model.layers.21': 2, 'llm_model.model.layers.22': 2, 'llm_model.model.layers.23': 2, 'llm_model.model.layers.24': 2, 'llm_model.model.layers.25': 2, 'llm_model.model.layers.26': 2, 'llm_model.model.layers.27': 3, 'llm_model.model.layers.28': 3, 'llm_model.model.layers.29': 3, 'llm_model.model.layers.30': 3, 'llm_model.model.layers.31': 3, 'llm_model.model.layers.32': 3, 'llm_model.model.layers.33': 3, 'llm_model.model.layers.34': 3, 'llm_model.model.layers.35': 3, 'llm_model.model.layers.36': 3, 'llm_model.model.layers.37': 3, 'llm_model.model.layers.38': 3, 'llm_model.model.layers.39': 3, 'llm_model.model.norm': 3, 'llm_model.lm_head': 1, 'llm_proj': 1}

pyogher commented 1 year ago

@saffie91 Hi!

I'm not sure, but I noticed that your 'llm_model.model.norm' is on GPU cuda:3. I suggest setting device_map['llm_model.lm_head'], device_map['llm_proj'], device_map['llm_model.model.norm'], and device_map['llm_model.model.embed_tokens'] to the same device. Additionally, please ensure that your input data is also on cuda:1.

If you still encounter this issue, I suggest modifying the source code based on the error. Generally, a simple modification to make the tensor causing the error on the same device may fix this bug.

saffie91 commented 1 year ago

@pyogher this ended up with the same error.

The image is on cuda:1 but perhaps the prompt text is not? Would that be possible?

CCYChongyanChen commented 1 year ago

I came across this error: We need an offload_dir to dispatch this model according to this device_map, the following submodules need to be offloaded: llm_model.model.embed_tokens, llm_model.model.layers.39, llm_model.model.norm, llm_model.lm_head, llm_proj.

adding offload_dir param to dispatch_model function solves the issue

CCYChongyanChen commented 1 year ago

@pyogher this ended up with the same error.

The image is on cuda:1 but perhaps the prompt text is not? Would that be possible?

I also have this question; how to load prompt to the same device? Also model is converted to bfloat16, do we need to process the text and image input to bfloat16 also?

jameswan commented 1 year ago

Which file has load_model_and_preprocess() ?

HaoranLv commented 1 year ago

I encountered the same problem, it should not be a problem with vicuna-13B but with the blip source code, the same 4 GPUs do the modeling in parallel, LLAVA doesn't have a problem but BLIP does.

Yuancheng-Xu commented 1 year ago

Also encounter the same issue (using Huggingface transformer). Are there any updates on this?

jameswan commented 1 year ago

Can anyone get InstructBLIP to work using transformers from Python? I would be grateful if you shared the code.

This is mine

#!/usr/bin/env python

import os
from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image
import requests

model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True, torch_dtype=torch.float16)

url = "https://raw.githubusercontent.com/salesforce/LAVIS/main/docs/_static/Confusing-Pictures.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

prompt = "What is unusual about this image?"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device="cuda", dtype=torch.float16)

outputs = model.generate(
        **inputs,
        num_beams=5,
        max_new_tokens=256,
        min_length=1,
        top_p=0.9,
        repetition_penalty=1.5,
        length_penalty=1.0,
        temperature=1,
)
outputs[outputs == 0] = 2 # this line can be removed once https://github.com/huggingface/transformers/pull/24492 is fixed
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

AKOrojo commented 1 year ago

# Determine if CUDA (GPU) is available.
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model configuration.
config = InstructBlipConfig.from_pretrained("Salesforce/instructblip-vicuna-13b")

# Initialize the model with the given configuration.
with init_empty_weights():
    model = AutoModelForVision2Seq.from_config(config)
    model.tie_weights()

# Infer device map based on the available resources.
device_map = infer_auto_device_map(model, max_memory={0: "30GiB", 1: "30GiB"},
                                   no_split_module_classes=['InstructBlipEncoderLayer', 'InstructBlipQFormerLayer',
                                                            'LlamaDecoderLayer'])
device_map['language_model.lm_head'] = device_map['language_projection'] = device_map[('language_model.model'
                                                                                       '.embed_tokens')]

offload = ""
# Load the processor and model for image processing.
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-13b", device_map="auto")
model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-13b",
                                                             device_map=device_map,
                                                             offload_folder=offload, offload_state_dict=True)

jameswan commented 1 year ago

NameError: name 'init_empty_weights' is not defined

The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function. Loading checkpoint shards: 67%|████████████████████████████████████████▋ | 4/6 [00:31<00:15, 7.88s/it] Traceback (most recent call last): File "C:\Users\james\test_instructblip_5.py", line 38, in processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-13b", device_map="auto") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\james\blip_env\Lib\site-packages\transformers\modeling_utils.py", line 3010, in from_pretrained ) = cls._load_pretrained_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\james\blip_env\Lib\site-packages\transformers\modeling_utils.py", line 3388, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\james\blip_env\Lib\site-packages\transformers\modeling_utils.py", line 722, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "C:\Users\james\blip_env\Lib\site-packages\accelerate\utils\modeling.py", line 313, in set_module_tensor_to_device new_value = value.to(device) ^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 1; 23.99 GiB total capacity; 22.84 GiB already allocated; 0 bytes free; 22.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

# Determine if CUDA (GPU) is available.
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model configuration.
config = InstructBlipConfig.from_pretrained("Salesforce/instructblip-vicuna-13b")

# Initialize the model with the given configuration.
with init_empty_weights():
    model = AutoModelForVision2Seq.from_config(config)
    model.tie_weights()

# Infer device map based on the available resources.
device_map = infer_auto_device_map(model, max_memory={0: "30GiB", 1: "30GiB"},
                                   no_split_module_classes=['InstructBlipEncoderLayer', 'InstructBlipQFormerLayer',
                                                            'LlamaDecoderLayer'])
device_map['language_model.lm_head'] = device_map['language_projection'] = device_map[('language_model.model'
                                                                                       '.embed_tokens')]

offload = ""
# Load the processor and model for image processing.
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-13b", device_map="auto")
model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-13b",
                                                             device_map=device_map,
                                                             offload_folder=offload, offload_state_dict=True)

AKOrojo commented 1 year ago

NameError: name 'init_empty_weights' is not defined

The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function. Loading checkpoint shards: 67%|████████████████████████████████████████▋ | 4/6 [00:31<00:15, 7.88s/it] Traceback (most recent call last): File "C:\Users\james\test_instructblip_5.py", line 38, in processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-13b", device_map="auto") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\james\blip_env\Lib\site-packages\transformers\modeling_utils.py", line 3010, in from_pretrained ) = cls._load_pretrained_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\james\blip_env\Lib\site-packages\transformers\modeling_utils.py", line 3388, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\james\blip_env\Lib\site-packages\transformers\modeling_utils.py", line 722, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "C:\Users\james\blip_env\Lib\site-packages\accelerate\utils\modeling.py", line 313, in set_module_tensor_to_device new_value = value.to(device) ^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 1; 23.99 GiB total capacity; 22.84 GiB already allocated; 0 bytes free; 22.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
# Determine if CUDA (GPU) is available.
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model configuration.
config = InstructBlipConfig.from_pretrained("Salesforce/instructblip-vicuna-13b")

# Initialize the model with the given configuration.
with init_empty_weights():
    model = AutoModelForVision2Seq.from_config(config)
    model.tie_weights()

# Infer device map based on the available resources.
device_map = infer_auto_device_map(model, max_memory={0: "30GiB", 1: "30GiB"},
                                   no_split_module_classes=['InstructBlipEncoderLayer', 'InstructBlipQFormerLayer',
                                                            'LlamaDecoderLayer'])
device_map['language_model.lm_head'] = device_map['language_projection'] = device_map[('language_model.model'
                                                                                       '.embed_tokens')]

offload = ""
# Load the processor and model for image processing.
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-13b", device_map="auto")
model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-13b",
                                                             device_map=device_map,
                                                             offload_folder=offload, offload_state_dict=True)

you need to adjust the max memory to match the set of GPUs you have, i have two 32 GB GPU, also import the needed methods from accelerate.

jameswan commented 1 year ago

you need to adjust the max memory to match the set of GPUs you have, i have two 32 GB GPU, also import the needed methods from accelerate.

I have already done that. I have 2 X RTX 4090. See my code:

device_map = infer_auto_device_map(model, max_memory={0: "24GiB", 1: "24GiB"},
                                   no_split_module_classes=['InstructBlipEncoderLayer', 'InstructBlipQFormerLayer',
                                                            'LlamaDecoderLayer'])

I still get this error: The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function. Loading checkpoint shards: 67%|████████████████████████████████████████▋ | 4/6 [00:23<00:11, 5.77s/it] Traceback (most recent call last): File "C:\Users\james\test_instructblip_5.py", line 39, in model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-13b", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\james\blip_env\Lib\site-packages\transformers\modeling_utils.py", line 3010, in from_pretrained ) = cls._load_pretrained_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\james\blip_env\Lib\site-packages\transformers\modeling_utils.py", line 3388, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\james\blip_env\Lib\site-packages\transformers\modeling_utils.py", line 722, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "C:\Users\james\blip_env\Lib\site-packages\accelerate\utils\modeling.py", line 313, in set_module_tensor_to_device new_value = value.to(device) ^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 1; 23.99 GiB total capacity; 22.84 GiB already allocated; 0 bytes free; 22.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

AKOrojo commented 1 year ago

you need to declare it about 5 GiB below the max, it needs space for overhead.

AKOrojo commented 1 year ago

also print the total model size you might need to add space on your cpu to allocate the whole model. "cpu":

jameswan commented 12 months ago

@AKOrojo Thank you.

So I reduced device_map = infer_auto_device_map(model, max_memory={0: "19GiB", 1: "19GiB"}, no_split_module_classes=['InstructBlipEncoderLayer', 'InstructBlipQFormerLayer', 'LlamaDecoderLayer'])

But now I get this error: The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function. Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████| 6/6 [00:34<00:00, 5.73s/it] You shouldn't move a model when it is dispatched on multiple devices. Traceback (most recent call last): File "C:\Users\james\test_instructblip_5.py", line 43, in model.to(device) File "C:\Users\james\blip_env\Lib\site-packages\accelerate\big_modeling.py", line 410, in wrapper raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.") RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

AKOrojo commented 12 months ago

Remove the line to move the model. The device map is what moves the model now.

HaithemH commented 8 months ago

ss the following code should correctly load the 13B model:

Same problem

HaithemH commented 8 months ago

Hi all This my device map {'query_tokens': 0, 'visual_encoder': 0, 'ln_vision': 0, 'Qformer': 0, 't5_model.shared': 0, 't5_model.decoder.embed_tokens': 0, 't5_model.encoder.embed_tokens': 0, 't5_model.encoder.block.0': 0, 't5_model.encoder.block.1': 0, 't5_model.encoder.block.2': 0, 't5_model.encoder.block.3': 0, 't5_model.encoder.block.4': 0, 't5_model.encoder.block.5': 0, 't5_model.encoder.block.6': 0, 't5_model.encoder.block.7': 0, 't5_model.encoder.block.8.layer.0.SelfAttention.q': 0, 't5_model.encoder.block.8.layer.0.SelfAttention.k': 1, 't5_model.encoder.block.8.layer.0.SelfAttention.v': 1, 't5_model.encoder.block.8.layer.0.SelfAttention.o': 1, 't5_model.encoder.block.8.layer.0.layer_norm': 1, 't5_model.encoder.block.8.layer.0.dropout': 1, 't5_model.encoder.block.8.layer.1': 1, 't5_model.encoder.block.9': 1, 't5_model.encoder.block.10': 1, 't5_model.encoder.block.11': 1, 't5_model.encoder.block.12': 1, 't5_model.encoder.block.13': 1, 't5_model.encoder.block.14': 1, 't5_model.encoder.block.15': 1, 't5_model.encoder.block.16': 1, 't5_model.encoder.block.17': 1, 't5_model.encoder.block.18': 1, 't5_model.encoder.block.19': 1, 't5_model.encoder.block.20': 1, 't5_model.encoder.block.21': 1, 't5_model.encoder.block.22': 1, 't5_model.encoder.block.23': 1, 't5_model.encoder.final_layer_norm': 1, 't5_model.encoder.dropout': 1, 't5_model.decoder.block.0.layer.0': 1, 't5_model.decoder.block.0.layer.1': 1, 't5_model.decoder.block.1': 2, 't5_model.decoder.block.2': 2, 't5_model.decoder.block.3': 2, 't5_model.decoder.block.4': 2, 't5_model.decoder.block.5': 2, 't5_model.decoder.block.6': 2, 't5_model.decoder.block.7': 2, 't5_model.decoder.block.8': 2, 't5_model.decoder.block.9': 2, 't5_model.decoder.block.10': 2, 't5_model.decoder.block.11': 2, 't5_model.decoder.block.12.layer.0': 2, 't5_model.decoder.block.12.layer.1': 2, 't5_model.decoder.block.12.layer.2.DenseReluDense.wi_0': 2, 't5_model.decoder.block.12.layer.2.DenseReluDense.wi_1': 2, 't5_model.decoder.block.12.layer.2.DenseReluDense.wo': 3, 't5_model.decoder.block.12.layer.2.DenseReluDense.dropout': 3, 't5_model.decoder.block.12.layer.2.DenseReluDense.act': 3, 't5_model.decoder.block.12.layer.2.layer_norm': 3, 't5_model.decoder.block.12.layer.2.dropout': 3, 't5_model.decoder.block.13': 3, 't5_model.decoder.block.14': 3, 't5_model.decoder.block.15': 3, 't5_model.decoder.block.16': 3, 't5_model.decoder.block.17': 3, 't5_model.decoder.block.18': 3, 't5_model.decoder.block.19': 3, 't5_model.decoder.block.20': 3, 't5_model.decoder.block.21': 3, 't5_model.decoder.block.22': 3, 't5_model.decoder.block.23': 3, 't5_model.decoder.final_layer_norm': 3, 't5_model.decoder.dropout': 3, 't5_model.lm_head': 3, 't5_proj': 3, 't5_model.decoder.block.0.layer.2': 2} I got this error : inputs_embeds = torch.cat([inputs_t5, inputs_embeds], dim=1) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument tensors in method wrapper___cat) Any idea how to solve this? Thanks

salesforce / LAVIS

Load InstructBLIP Across Multiple GPUs? #321