Closed kunaldeo closed 1 year ago
cc @Narsil @ArthurZucker @younesbelkada
This torch error usually comes from bad vocabulary ids being sent to the model.
What's odd is that is seems triggered by device_map="auto"
(it's the only modification with the working code, right ?)
Which shouldn't torch this in any way. I doubt the actual cause is the referered line by the stacktrace, even despite CUDA_LAUNCH_BLOCKING, but I could be wrong (and that would be super puzzling indeed).
Note: Afaik, if you launch on multiple GPUs in this way, you would be in PP (PIpeline Parallelism) and not in TP (TensorParallelism), TP being much better at getting better latencies (PP will get you larger batch sizes). https://github.com/huggingface/text-generation-inference might be better to reduce latencies by using multiple GPUs (NVLink presence will definitely be a factor)
Yes device_map="auto" is the only modification.
The other thing is that RTX 4090 doesnt support NVLink.
Could be linked to accelerate
here. At least I don't have good ideas to what might be happening here.
Re accelerate
- @pacman100 would you have any idea what might be causing this issue?
Hello, as this is related to the big model inference/device_map, @sgugger might have a better idea wrt this issue.
Please post a full reproducer. I don't have access to"
/models/wizard-vicuna-13B-HF
get_prompt
functionparse_text
functionUsing "huggyllama/llama-13b"
on my side and removing get_prompt
and parse_text
works fine on two GPUs.
I have removed all the other code and it is just now the following. I am still getting the same error.
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
model_name = "huggyllama/llama-13b"
tokenizer = LlamaTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(model_name,
device_map='auto',
torch_dtype=torch.float16,
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_length=512,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
raw_output = pipe("Hi how are you")
Error:
[0,0,0], thread: [40,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [41,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [42,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [45,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [46,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [47,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [48,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [49,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [60,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [62,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 raw_output = pipe("Hi how are you")
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/text_generation.py:201, in TextGenerationPipeline.__call__(self, text_inputs, **kwargs)
160 def __call__(self, text_inputs, **kwargs):
161 """
162 Complete the prompt(s) given as inputs.
163
(...)
199 ids of the generated text.
200 """
--> 201 return super().__call__(text_inputs, **kwargs)
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/base.py:1118, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1110 return next(
1111 iter(
1112 self.get_iterator(
(...)
1115 )
1116 )
1117 else:
-> 1118 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/base.py:1125, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
1123 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
1124 model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1125 model_outputs = self.forward(model_inputs, **forward_params)
1126 outputs = self.postprocess(model_outputs, **postprocess_params)
1127 return outputs
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/base.py:1024, in Pipeline.forward(self, model_inputs, **forward_params)
1022 with inference_context():
1023 model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1024 model_outputs = self._forward(model_inputs, **forward_params)
1025 model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
1026 else:
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/text_generation.py:263, in TextGenerationPipeline._forward(self, model_inputs, **generate_kwargs)
260 generate_kwargs["min_length"] += prefix_length
262 # BS x SL
--> 263 generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
264 out_b = generated_sequence.shape[0]
265 if self.framework == "pt":
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py:1518, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
1512 raise ValueError(
1513 "num_return_sequences has to be 1 when doing greedy search, "
1514 f"but is {generation_config.num_return_sequences}."
1515 )
1517 # 11. run greedy search
-> 1518 return self.greedy_search(
1519 input_ids,
1520 logits_processor=logits_processor,
1521 stopping_criteria=stopping_criteria,
1522 pad_token_id=generation_config.pad_token_id,
1523 eos_token_id=generation_config.eos_token_id,
1524 output_scores=generation_config.output_scores,
1525 return_dict_in_generate=generation_config.return_dict_in_generate,
1526 synced_gpus=synced_gpus,
1527 streamer=streamer,
1528 **model_kwargs,
1529 )
1531 elif is_contrastive_search_gen_mode:
1532 if generation_config.num_return_sequences > 1:
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py:2335, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2332 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2334 # forward pass to get next token
-> 2335 outputs = self(
2336 **model_inputs,
2337 return_dict=True,
2338 output_attentions=output_attentions,
2339 output_hidden_states=output_hidden_states,
2340 )
2342 if synced_gpus and this_peer_finished:
2343 continue # don't waste resources running the code we don't need
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:688, in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
685 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
687 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 688 outputs = self.model(
689 input_ids=input_ids,
690 attention_mask=attention_mask,
691 position_ids=position_ids,
692 past_key_values=past_key_values,
693 inputs_embeds=inputs_embeds,
694 use_cache=use_cache,
695 output_attentions=output_attentions,
696 output_hidden_states=output_hidden_states,
697 return_dict=return_dict,
698 )
700 hidden_states = outputs[0]
701 logits = self.lm_head(hidden_states)
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:578, in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
570 layer_outputs = torch.utils.checkpoint.checkpoint(
571 create_custom_forward(decoder_layer),
572 hidden_states,
(...)
575 None,
576 )
577 else:
--> 578 layer_outputs = decoder_layer(
579 hidden_states,
580 attention_mask=attention_mask,
581 position_ids=position_ids,
582 past_key_value=past_key_value,
583 output_attentions=output_attentions,
584 use_cache=use_cache,
585 )
587 hidden_states = layer_outputs[0]
589 if use_cache:
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:292, in LlamaDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
289 hidden_states = self.input_layernorm(hidden_states)
291 # Self Attention
--> 292 hidden_states, self_attn_weights, present_key_value = self.self_attn(
293 hidden_states=hidden_states,
294 attention_mask=attention_mask,
295 position_ids=position_ids,
296 past_key_value=past_key_value,
297 output_attentions=output_attentions,
298 use_cache=use_cache,
299 )
300 hidden_states = residual + hidden_states
302 # Fully Connected
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:227, in LlamaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
222 raise ValueError(
223 f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
224 )
225 attn_weights = attn_weights + attention_mask
226 attn_weights = torch.max(
--> 227 attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
228 )
230 # upcast attention to fp32
231 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Additional info
model.hf_device_map
{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 2, 'model.layers.28': 2, 'model.layers.29': 2, 'model.layers.30': 2, 'model.layers.31': 2, 'model.layers.32': 2, 'model.layers.33': 2, 'model.layers.34': 2, 'model.layers.35': 2, 'model.layers.36': 2, 'model.layers.37': 2, 'model.layers.38': 2, 'model.layers.39': 2, 'model.norm': 2, 'lm_head': 2}
This issue is not happening after transformers update 4.30.2
.
This issue is not happening after transformers update
4.30.2
.
Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal
This issue is not happening after transformers update
4.30.2
.Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal
@Xnhyacinth were you able to solve this by any chance? I'm getting the same error with transformers 4.33.3
(my full case description is here)
This issue is not happening after transformers update
4.30.2
.Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal
@Xnhyacinth were you able to solve this by any chance? I'm getting the same error with transformers
4.33.3
(my full case description is here)
@kerenganon @Xnhyacinth - Were either of you able to solve this? I'm getting the same error. It began when I upgraded the CUDA Driver Version from 11.? to 12.2 and updated the NVIDIA Driver Version to 535.113.01 - so perhaps related to driver versions in some way. Prior to upgrading the drivers I had no issues. After the upgrade, I get this error when I attempt to run inference using Llama models across multiple GPUs. The problem doesn't occur if I just use a single GPU. I haven't been able to see any improvement using changes to tokenizer eos or pad token_ids (as suggested elsewhere). The problem seems related to using device_map="auto" (or similar). I'm using transformers 4.31.0, so it doesn't seem to be fixed after 4.30.2 for me.
Encountered same issue on 4.33.2, tried other 13B models, no one could run pass. nvidia driver is 545.23.06 and CUDA is 12.3
This issue is not happening after transformers update
4.30.2
.Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal
@Xnhyacinth were you able to solve this by any chance? I'm getting the same error with transformers
4.33.3
(my full case description is here)@kerenganon @Xnhyacinth - Were either of you able to solve this? I'm getting the same error. It began when I upgraded the CUDA Driver Version from 11.? to 12.2 and updated the NVIDIA Driver Version to 535.113.01 - so perhaps related to driver versions in some way. Prior to upgrading the drivers I had no issues. After the upgrade, I get this error when I attempt to run inference using Llama models across multiple GPUs. The problem doesn't occur if I just use a single GPU. I haven't been able to see any improvement using changes to tokenizer eos or pad token_ids (as suggested elsewhere). The problem seems related to using device_map="auto" (or similar). I'm using transformers 4.31.0, so it doesn't seem to be fixed after 4.30.2 for me.
Yes, the problem could be a driver issue, details can be viewed here https://github.com/huggingface/transformers/issues/26096 and https://discuss.pytorch.org/t/problem-transfering-tensor-between-gpus/60263/10, I think you need to change the driver version or fix it with docker.
This issue is not happening after transformers update
4.30.2
.Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal
@Xnhyacinth were you able to solve this by any chance? I'm getting the same error with transformers
4.33.3
(my full case description is here)
Maybe you need to check the error like https://github.com/huggingface/transformers/issues/26096, and I think also the driver issue, so change the driver version.
I tested driver 535 + cuda 12.1 / 12.3, driver 520 + cuda 11.8. Both not working. Got the same ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
Someone please give a help. Thanks
I tested driver 535 + cuda 12.1 / 12.3, driver 520 + cuda 11.8. Both not working. Got the same
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
Someone please give a help. Thanks
@abcbdf if you just want to inference on 4090 with multi gpus, maybe you can try vllm with tensor parallel, which solved my problem.
I tested driver 535 + cuda 12.1 / 12.3, driver 520 + cuda 11.8. Both not working. Got the same
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
Someone please give a help. Thanks@abcbdf if you just want to inference on 4090 with multi gpus, maybe you can try vllm with tensor parallel, which solved my problem.
@caseylai Thanks for your help. But I'm actually working on training with multi A6000. I found that the problem was also shown in inference
I tested driver 535 + cuda 12.1 / 12.3, driver 520 + cuda 11.8. Both not working. Got the same
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
Someone please give a help. Thanks
maybe you can try driver 530 + cuda 12.1 or 530 + cuda 11.8
@Xnhyacinth Thanks. I tried install driver 530.30.02 + cuda 11.8. It still can't work on multi-gpu. This is quite weird because I have another server with basically same environments but it could work on multi-gpu inference/training. Working server: driver 530.30.02
python
Python 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__config__.show())
PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.7
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.9.2 (built against CUDA 12.1)
- Built with CuDNN 8.5
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
Not working server: driver 530.30.02
import torch
>>> print(torch.__config__.show())
PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 11.8
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.7
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
>>> torch.__version__
'2.1.0+cu118
On this not working server, I also tried create another conda environment with pytorch cuda 11.7:
python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.0.0'
>>> print(torch.__config__.show())
PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.7
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.5
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
It still raise error on multi-gpu inference.
I'm wondering do I need to install cuda toolkit separately? Because pytorch uses its own cuda runtime library and I couldn't find any difference between install cuda or not
Finally solved this by disabling ACS in bios, ref https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs
This test is very helpful. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-gpu-communication
my code:
from flask import Flask, request, jsonify
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import logging
import os
#print the version of cuda being used
print(torch.version.cuda)
app = Flask(__name__)
# get directory of this file
dir_path = os.path.dirname(os.path.realpath(__file__))
modellocation = "/home/levi/projects/text-generation-webui/models/Upstage_SOLAR-10.7B-Instruct-v1.0"
tokenizer = AutoTokenizer.from_pretrained("/home/levi/projects/text-generation-webui/models/Upstage_SOLAR-10.7B-Instruct-v1.0")
model = AutoModelForCausalLM.from_pretrained(
modellocation,
device_map="auto"
)
@app.before_request
def start_timer():
request.start_time = time.time()
print(f"Request made to LLM; starting timer!")
@app.after_request
def log_request(response):
request_duration = time.time() - request.start_time
print(f"Request took {request_duration} seconds")
return response
@app.route('/generate/chat/completions', methods=['POST'])
def generate_completions():
data = request.get_json()
conversation = data['messages']
prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, use_cache=True, max_length=4096)
output_text = tokenizer.decode(outputs[0])
return jsonify({'choices': [{'message': {'role': 'assistant', 'content': output_text}}]})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
I have 2 RTX 3060's and i am able to run LLM's on One GPU but it wont work when i try to run them on 2 GPU's with the same error:
/opt/anaconda3/envs/333MillionEyes/lib/python3.10/site-packages/transformers/generation/utils.py:1518: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
warnings.warn(
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [2,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [5,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [6,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [7,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [8,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [9,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [10,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [11,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [13,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [17,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [18,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [20,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [21,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [22,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [24,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [29,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [30,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
Running Arch
NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
It runs fine on single GPU, but when i try to run on multiple it does not like it. I have tried with oobabooga and vllm and none of them make a difference, it always fails. I have tried with many models of many sizes and types 7b 8x7b 13b 33b 34b, awq doesnt work, gptq doesnt work either.
my motherboard is a MSi MEG x570 with a AMD APU in it, and it has no option to disable ACS in the bios
GPU comms test came back good (i think ):
CUDA_VISIBLE_DEVICES=0,1 ./p2pBandwidthLatencyTest levi@deuxbeast
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3060, pciBusID: 10, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3060, pciBusID: 2d, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 331.46 3.17
1 3.17 331.67
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 331.74 2.93
1 2.93 331.81
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 318.71 4.72
1 4.70 332.59
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 318.81 2.93
1 2.93 332.59
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.41 13.24
1 13.11 1.42
CPU 0 1
0 2.35 6.27
1 6.78 2.28
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.42 1.14
1 1.19 1.41
CPU 0 1
0 2.34 1.89
1 1.94 2.48
Let me know what else you need me to do/run/compile or download to help get this fixed.
I just met that problem last time. If you don't want to reinstall driver , actually you can do transfer between gpu-cpu-gpu, it can solve the problem.
Having this exact same problem. The setup is supermicro4124gs, 8xrtx4090, ubuntu22.04. I have tried cu118/cu121 and disabling ACS, but the problem still persists. Everything works fine with a single gpu, but those errors show up when doing multi-gpu LLM works.
System Info
transformers version
: 4.30.0.dev0Platform:
Linux 6.3.5-zen2-1-zen-x86_64-with-glibc2.37.3 on ArchPython version
: 3.10.9PyTorch version (GPU)
: 2.0.1+cu118 (True)peft version
: 0.4.0.dev0accelerate version
: 0.20.0.dev0bitsandbytes version
: 0.39.0nvidia driver version
: nvidia-dkms-530.41.03-1Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run the following code Code:
While this code works fine on a single 4090 GPU. Loading any model for inference with 2 or 3 RTX 4090 is resulting in the following error:
Expected behavior
Code does inference successfully.