Accelerate load_checkpoint_and_dispatch - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1

adarsh-ks commented 3 months ago

Hello,

I am facing below issue while trying to load Meta LLM 13B-chat model

Error RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1

My code as below

The model_dir will have the model weights downloaded from Meta for 'llama-2-13b-chat'

from accelerate import init_empty_weights, load_checkpoint_and_dispatch
import torch
import transformers
from transformers import LlamaForCausalLM, LlamaTokenizer

model_dir = "/opt/ml/model"
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"

model = None
with init_empty_weights():
    model = LlamaForCausalLM.from_pretrained(model_dir)

loaded_model = load_checkpoint_and_dispatch(
    model, checkpoint=model_dir, device_map="auto", offload_folder="/tmp/ml/model", max_memory={0: "10GiB", 1: "10GiB", 2: "10GiB"}
)

tokenizer = LlamaTokenizer.from_pretrained(model_dir)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    framework="pt"
)

sequences = pipeline(
    'Receipe for Cheese Pizza\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=400,
    max_new_tokens=64,
    return_full_text=False,
    top_p=0.9,
    temperature=0.6
)

adarsh-ks commented 3 months ago

I tried to add below line of code after invoking load_checkpoint_and_dispatch, but getting a new error

loaded_model = loaded_model.to(device)

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

SunMarc commented 3 months ago

Hi @adarsh-ks, thanks for reporting ! Could you share the traceback you get ? Also, note that you can perform exactly what you doing in one line : model = LlamaForCausalLM.from_pretrained(model_dir, device_map="auto")

adarsh-ks commented 3 months ago

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1

Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 882, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(view_args) # type: ignore[no-any-return] File "/opt/program/predictor.py", line 68, in generate_response sequences = pipeline( File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/text_generation.py", line 219, in call return super().call(text_inputs, kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1162, in call return self.run_single(inputs, preprocess_params, forward_params, postprocess_params) File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1169, in run_single model_outputs = self.forward(model_inputs, forward_params) File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1068, in forward model_outputs = self._forward(model_inputs, forward_params) File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/text_generation.py", line 295, in _forward generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, generate_kwargs) File "/usr/local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/generation/utils.py", line 1520, in generate return self.sample( File "/usr/local/lib/python3.9/site-packages/transformers/generation/utils.py", line 2617, in sample outputs = self( File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward output = module._old_forward(*args, kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward outputs = self.model( File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1070, in forward layer_outputs = decoder_layer( File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 813, in forward hidden_states = residual + hidden_states

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

adarsh-ks commented 3 months ago

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

WARNING:root:Some parameters are on the meta device device because they were offloaded to the disk.

WARNING:accelerate.big_modeling:You shouldn't move a model when it is dispatched on multiple devices.

Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 882, in full_dispatch_request rv = self.handle_user_exception(e)

169.254.178.2 - - [27/Jun/2024:00:17:23 +0000] "POST /invocations HTTP/1.1" 500 265 "-" "AHC/2.0" File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] File "/opt/program/predictor.py", line 59, in generate_response loaded_model = loaded_model.to(device) File "/usr/local/lib/python3.9/site-packages/accelerate/big_modeling.py", line 428, in wrapper raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

adarsh-ks commented 3 months ago

Hi @adarsh-ks, thanks for reporting ! Could you share the traceback you get ? Also, note that you can perform exactly what you doing in one line : model = LlamaForCausalLM.from_pretrained(model_dir, device_map="auto")

@SunMarc , I have updated the stack trace below

adarsh-ks commented 3 months ago

Also, I am using init_empty_weights() and load_checkpoint_and_dispatch() from accelerate for GPU acceleration

Without this, the model was not accelerating GPU and I was getting below error

Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 882, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(view_args) # type: ignore[no-any-return] File "/opt/program/predictor.py", line 48, in generate_response pipeline = transformers.pipeline( File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/init.py", line 1070, in pipeline return pipeline_class(model=model, framework=framework, task=task, kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/text_generation.py", line 70, in init super().init(*args, *kwargs) File "/usr/local/lib/python3.9/site-packages/transformers/pipelines/base.py", line 840, in init self.model.to(self.device) File "/usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2597, in to return super().to(args, **kwargs) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 989, in to return self._apply(convert) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 664, in _apply param_applied = fn(param) File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 987, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 0; 22.20 GiB total capacity; 21.39 GiB already allocated; 70.12 MiB free; 21.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

adarsh-ks commented 3 months ago

Can I get support for this issue?

SunMarc commented 3 months ago

Hi @adarsh-ks, since you are using pipeline, please follow the tutorial here. You shouldn't need to use init_empty_weights() and load_checkpoint_and_dispatch()

adarsh-ks commented 3 months ago

@SunMarc , I am getting a response with the code you shared. However it is taking more than 3 minutes to get a response and the application which I am trying to build is getting timedout as I have a response time set as 60 seconds. Seems GPU acceleration is not happening, Do you have any suggestions on to improving the performance?

Please find below my updated code.

import torch
import transformers
from transformers import LlamaForCausalLM, LlamaTokenizer

model_dir = "/opt/ml/model"

model = LlamaForCausalLM.from_pretrained(model_dir)

tokenizer = LlamaTokenizer.from_pretrained(model_dir)

pipeline = transformers.pipeline(
  "text-generation",
  model=model,
  tokenizer=tokenizer,
  torch_dtype=torch.float16,
  device_map="auto",
  framework="pt"
)

sequences = pipeline(
  'Receipe for Cheese Pizza\n',
  do_sample=True,
  top_k=10,
  num_return_sequences=1,
  eos_token_id=tokenizer.eos_token_id,
  max_length=400,
  max_new_tokens=64,
  return_full_text=False,
  top_p=0.9,
  temperature=0.6
)

result=""
for seq in sequences:
  result += seq['generated_text']
  print(f"{seq['generated_text']}")

print(result)

SunMarc commented 3 months ago

Hey @adarsh-ks, Could you try the following ? You don't need to initialize the model and the tokenizer beforehand :

import torch
import transformers

model_dir = "/opt/ml/model"

pipeline = transformers.pipeline(
  "text-generation",
  model=model_dir,
  torch_dtype=torch.float16,
  device_map="auto",
  framework="pt"
)

sequences = pipeline(
  'Receipe for Cheese Pizza\n',
  do_sample=True,
  top_k=10,
  num_return_sequences=1,
  eos_token_id=tokenizer.eos_token_id,
  max_length=400,
  max_new_tokens=64,
  return_full_text=False,
  top_p=0.9,
  temperature=0.6
)

result=""
for seq in sequences:
  result += seq['generated_text']
  print(f"{seq['generated_text']}")

print(result)

adarsh-ks commented 3 months ago

Thank you @SunMarc , The above fix solved the issue and I am able to get the responses under 1 minute. Closing the issue as it is resolved.

SunMarc commented 3 months ago

Glad that it fixed the issue !

huggingface / accelerate

Accelerate load_checkpoint_and_dispatch - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1 #2897