huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.56k stars 26.69k forks source link

Multi GPU inference on RTX 4090 fails with RuntimeError: CUDA error: device-side assert triggered (Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.) #24056

Closed kunaldeo closed 1 year ago

kunaldeo commented 1 year ago

System Info

transformers version: 4.30.0.dev0 Platform: Linux 6.3.5-zen2-1-zen-x86_64-with-glibc2.37.3 on Arch Python version: 3.10.9 PyTorch version (GPU): 2.0.1+cu118 (True) peft version: 0.4.0.dev0 accelerate version: 0.20.0.dev0 bitsandbytes version: 0.39.0 nvidia driver version: nvidia-dkms-530.41.03-1

Who can help?

No response

Information

Tasks

Reproduction

Run the following code Code:

from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch
import os

# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

model_name = "/models/wizard-vicuna-13B-HF"

tokenizer = LlamaTokenizer.from_pretrained(model_name)

model = LlamaForCausalLM.from_pretrained(model_name,
                                              device_map='auto',
                                              torch_dtype=torch.float16,
                                              )

pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

prompt = 'What are the difference between Llamas, Alpacas and Vicunas?'
raw_output = pipe(get_prompt(prompt))
parse_text(raw_output)

While this code works fine on a single 4090 GPU. Loading any model for inference with 2 or 3 RTX 4090 is resulting in the following error:

/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [67,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [68,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [69,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [70,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [71,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [72,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [73,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [74,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [75,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
--------many such lines----------

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:227, in LlamaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
    222         raise ValueError(
    223             f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
    224         )
    225     attn_weights = attn_weights + attention_mask
    226     attn_weights = torch.max(
--> 227         attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
    228     )
    230 # upcast attention to fp32
    231 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

Code does inference successfully.

amyeroberts commented 1 year ago

cc @Narsil @ArthurZucker @younesbelkada

Narsil commented 1 year ago

This torch error usually comes from bad vocabulary ids being sent to the model.

What's odd is that is seems triggered by device_map="auto" (it's the only modification with the working code, right ?) Which shouldn't torch this in any way. I doubt the actual cause is the referered line by the stacktrace, even despite CUDA_LAUNCH_BLOCKING, but I could be wrong (and that would be super puzzling indeed).

Note: Afaik, if you launch on multiple GPUs in this way, you would be in PP (PIpeline Parallelism) and not in TP (TensorParallelism), TP being much better at getting better latencies (PP will get you larger batch sizes). https://github.com/huggingface/text-generation-inference might be better to reduce latencies by using multiple GPUs (NVLink presence will definitely be a factor)

kunaldeo commented 1 year ago

Yes device_map="auto" is the only modification.

The other thing is that RTX 4090 doesnt support NVLink.

Narsil commented 1 year ago

Could be linked to accelerate here. At least I don't have good ideas to what might be happening here.

amyeroberts commented 1 year ago

Re accelerate - @pacman100 would you have any idea what might be causing this issue?

pacman100 commented 1 year ago

Hello, as this is related to the big model inference/device_map, @sgugger might have a better idea wrt this issue.

sgugger commented 1 year ago

Please post a full reproducer. I don't have access to"

Using "huggyllama/llama-13b" on my side and removing get_prompt and parse_text works fine on two GPUs.

kunaldeo commented 1 year ago

I have removed all the other code and it is just now the following. I am still getting the same error.

from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch
import os

# os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
model_name = "huggyllama/llama-13b"
tokenizer = LlamaTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(model_name,
                                              device_map='auto',
                                              torch_dtype=torch.float16,
                                              )
pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
raw_output = pipe("Hi how are you")                                              

Error:

[0,0,0], thread: [40,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [41,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [42,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [45,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [46,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [47,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [48,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [49,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [60,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [62,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1682343995026/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[5], line 1
----> 1 raw_output = pipe("Hi how are you")

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/text_generation.py:201, in TextGenerationPipeline.__call__(self, text_inputs, **kwargs)
    160 def __call__(self, text_inputs, **kwargs):
    161     """
    162     Complete the prompt(s) given as inputs.
    163 
   (...)
    199           ids of the generated text.
    200     """
--> 201     return super().__call__(text_inputs, **kwargs)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/base.py:1118, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1110     return next(
   1111         iter(
   1112             self.get_iterator(
   (...)
   1115         )
   1116     )
   1117 else:
-> 1118     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/base.py:1125, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1123 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
   1124     model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1125     model_outputs = self.forward(model_inputs, **forward_params)
   1126     outputs = self.postprocess(model_outputs, **postprocess_params)
   1127     return outputs

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/base.py:1024, in Pipeline.forward(self, model_inputs, **forward_params)
   1022     with inference_context():
   1023         model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1024         model_outputs = self._forward(model_inputs, **forward_params)
   1025         model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
   1026 else:

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/pipelines/text_generation.py:263, in TextGenerationPipeline._forward(self, model_inputs, **generate_kwargs)
    260         generate_kwargs["min_length"] += prefix_length
    262 # BS x SL
--> 263 generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
    264 out_b = generated_sequence.shape[0]
    265 if self.framework == "pt":

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py:1518, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
   1512         raise ValueError(
   1513             "num_return_sequences has to be 1 when doing greedy search, "
   1514             f"but is {generation_config.num_return_sequences}."
   1515         )
   1517     # 11. run greedy search
-> 1518     return self.greedy_search(
   1519         input_ids,
   1520         logits_processor=logits_processor,
   1521         stopping_criteria=stopping_criteria,
   1522         pad_token_id=generation_config.pad_token_id,
   1523         eos_token_id=generation_config.eos_token_id,
   1524         output_scores=generation_config.output_scores,
   1525         return_dict_in_generate=generation_config.return_dict_in_generate,
   1526         synced_gpus=synced_gpus,
   1527         streamer=streamer,
   1528         **model_kwargs,
   1529     )
   1531 elif is_contrastive_search_gen_mode:
   1532     if generation_config.num_return_sequences > 1:

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py:2335, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2332 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2334 # forward pass to get next token
-> 2335 outputs = self(
   2336     **model_inputs,
   2337     return_dict=True,
   2338     output_attentions=output_attentions,
   2339     output_hidden_states=output_hidden_states,
   2340 )
   2342 if synced_gpus and this_peer_finished:
   2343     continue  # don't waste resources running the code we don't need

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:688, in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    685 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    687 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 688 outputs = self.model(
    689     input_ids=input_ids,
    690     attention_mask=attention_mask,
    691     position_ids=position_ids,
    692     past_key_values=past_key_values,
    693     inputs_embeds=inputs_embeds,
    694     use_cache=use_cache,
    695     output_attentions=output_attentions,
    696     output_hidden_states=output_hidden_states,
    697     return_dict=return_dict,
    698 )
    700 hidden_states = outputs[0]
    701 logits = self.lm_head(hidden_states)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:578, in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    570     layer_outputs = torch.utils.checkpoint.checkpoint(
    571         create_custom_forward(decoder_layer),
    572         hidden_states,
   (...)
    575         None,
    576     )
    577 else:
--> 578     layer_outputs = decoder_layer(
    579         hidden_states,
    580         attention_mask=attention_mask,
    581         position_ids=position_ids,
    582         past_key_value=past_key_value,
    583         output_attentions=output_attentions,
    584         use_cache=use_cache,
    585     )
    587 hidden_states = layer_outputs[0]
    589 if use_cache:

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:292, in LlamaDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
    289 hidden_states = self.input_layernorm(hidden_states)
    291 # Self Attention
--> 292 hidden_states, self_attn_weights, present_key_value = self.self_attn(
    293     hidden_states=hidden_states,
    294     attention_mask=attention_mask,
    295     position_ids=position_ids,
    296     past_key_value=past_key_value,
    297     output_attentions=output_attentions,
    298     use_cache=use_cache,
    299 )
    300 hidden_states = residual + hidden_states
    302 # Fully Connected

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File ~/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:227, in LlamaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
    222         raise ValueError(
    223             f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
    224         )
    225     attn_weights = attn_weights + attention_mask
    226     attn_weights = torch.max(
--> 227         attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
    228     )
    230 # upcast attention to fp32
    231 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Additional info

model.hf_device_map

{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 2, 'model.layers.28': 2, 'model.layers.29': 2, 'model.layers.30': 2, 'model.layers.31': 2, 'model.layers.32': 2, 'model.layers.33': 2, 'model.layers.34': 2, 'model.layers.35': 2, 'model.layers.36': 2, 'model.layers.37': 2, 'model.layers.38': 2, 'model.layers.39': 2, 'model.norm': 2, 'lm_head': 2}

kunaldeo commented 1 year ago

This issue is not happening after transformers update 4.30.2.

Xnhyacinth commented 1 year ago

This issue is not happening after transformers update 4.30.2.

Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal

kerenganon commented 1 year ago

This issue is not happening after transformers update 4.30.2.

Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal

@Xnhyacinth were you able to solve this by any chance? I'm getting the same error with transformers 4.33.3 (my full case description is here)

DrGlennn commented 12 months ago

This issue is not happening after transformers update 4.30.2.

Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal

@Xnhyacinth were you able to solve this by any chance? I'm getting the same error with transformers 4.33.3 (my full case description is here)

@kerenganon @Xnhyacinth - Were either of you able to solve this? I'm getting the same error. It began when I upgraded the CUDA Driver Version from 11.? to 12.2 and updated the NVIDIA Driver Version to 535.113.01 - so perhaps related to driver versions in some way. Prior to upgrading the drivers I had no issues. After the upgrade, I get this error when I attempt to run inference using Llama models across multiple GPUs. The problem doesn't occur if I just use a single GPU. I haven't been able to see any improvement using changes to tokenizer eos or pad token_ids (as suggested elsewhere). The problem seems related to using device_map="auto" (or similar). I'm using transformers 4.31.0, so it doesn't seem to be fixed after 4.30.2 for me.

caseylai commented 11 months ago

Encountered same issue on 4.33.2, tried other 13B models, no one could run pass. nvidia driver is 545.23.06 and CUDA is 12.3

Xnhyacinth commented 11 months ago

This issue is not happening after transformers update 4.30.2.

Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal

@Xnhyacinth were you able to solve this by any chance? I'm getting the same error with transformers 4.33.3 (my full case description is here)

@kerenganon @Xnhyacinth - Were either of you able to solve this? I'm getting the same error. It began when I upgraded the CUDA Driver Version from 11.? to 12.2 and updated the NVIDIA Driver Version to 535.113.01 - so perhaps related to driver versions in some way. Prior to upgrading the drivers I had no issues. After the upgrade, I get this error when I attempt to run inference using Llama models across multiple GPUs. The problem doesn't occur if I just use a single GPU. I haven't been able to see any improvement using changes to tokenizer eos or pad token_ids (as suggested elsewhere). The problem seems related to using device_map="auto" (or similar). I'm using transformers 4.31.0, so it doesn't seem to be fixed after 4.30.2 for me.

Yes, the problem could be a driver issue, details can be viewed here https://github.com/huggingface/transformers/issues/26096 and https://discuss.pytorch.org/t/problem-transfering-tensor-between-gpus/60263/10, I think you need to change the driver version or fix it with docker.

Xnhyacinth commented 11 months ago

This issue is not happening after transformers update 4.30.2.

Hi @kunaldeo But when I run your code above, it still reports the same error, and I checked that the transformer version is 4.30.2, maybe multi-GPU error, when I use a single GPU, it is normal

@Xnhyacinth were you able to solve this by any chance? I'm getting the same error with transformers 4.33.3 (my full case description is here)

Maybe you need to check the error like https://github.com/huggingface/transformers/issues/26096, and I think also the driver issue, so change the driver version.

abcbdf commented 9 months ago

I tested driver 535 + cuda 12.1 / 12.3, driver 520 + cuda 11.8. Both not working. Got the same ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. Someone please give a help. Thanks

caseylai commented 9 months ago

I tested driver 535 + cuda 12.1 / 12.3, driver 520 + cuda 11.8. Both not working. Got the same ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. Someone please give a help. Thanks

@abcbdf if you just want to inference on 4090 with multi gpus, maybe you can try vllm with tensor parallel, which solved my problem.

abcbdf commented 9 months ago

I tested driver 535 + cuda 12.1 / 12.3, driver 520 + cuda 11.8. Both not working. Got the same ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. Someone please give a help. Thanks

@abcbdf if you just want to inference on 4090 with multi gpus, maybe you can try vllm with tensor parallel, which solved my problem.

@caseylai Thanks for your help. But I'm actually working on training with multi A6000. I found that the problem was also shown in inference

Xnhyacinth commented 9 months ago

I tested driver 535 + cuda 12.1 / 12.3, driver 520 + cuda 11.8. Both not working. Got the same ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. Someone please give a help. Thanks

maybe you can try driver 530 + cuda 12.1 or 530 + cuda 11.8

abcbdf commented 9 months ago

@Xnhyacinth Thanks. I tried install driver 530.30.02 + cuda 11.8. It still can't work on multi-gpu. This is quite weird because I have another server with basically same environments but it could work on multi-gpu inference/training. Working server: driver 530.30.02

python
Python 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__config__.show())
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.9.2  (built against CUDA 12.1)
    - Built with CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

Not working server: driver 530.30.02

 import torch
>>> print(torch.__config__.show())
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

>>> torch.__version__
'2.1.0+cu118

On this not working server, I also tried create another conda environment with pytorch cuda 11.7:

python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.0.0'
>>> print(torch.__config__.show())
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

It still raise error on multi-gpu inference.

I'm wondering do I need to install cuda toolkit separately? Because pytorch uses its own cuda runtime library and I couldn't find any difference between install cuda or not

abcbdf commented 9 months ago

Finally solved this by disabling ACS in bios, ref https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

This test is very helpful. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-gpu-communication

levidehaan commented 9 months ago

my code:

from flask import Flask, request, jsonify
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import logging
import os
#print the version of cuda being used
print(torch.version.cuda)

app = Flask(__name__)

# get directory of this file
dir_path = os.path.dirname(os.path.realpath(__file__))

modellocation = "/home/levi/projects/text-generation-webui/models/Upstage_SOLAR-10.7B-Instruct-v1.0"

tokenizer = AutoTokenizer.from_pretrained("/home/levi/projects/text-generation-webui/models/Upstage_SOLAR-10.7B-Instruct-v1.0")

model = AutoModelForCausalLM.from_pretrained(
    modellocation,
    device_map="auto"
)

@app.before_request
def start_timer():
    request.start_time = time.time()
    print(f"Request made to LLM; starting timer!")

@app.after_request
def log_request(response):
    request_duration = time.time() - request.start_time
    print(f"Request took {request_duration} seconds")
    return response

@app.route('/generate/chat/completions', methods=['POST'])
def generate_completions():
    data = request.get_json()
    conversation = data['messages']

    prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, use_cache=True, max_length=4096)
    output_text = tokenizer.decode(outputs[0])

    return jsonify({'choices': [{'message': {'role': 'assistant', 'content': output_text}}]})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

I have 2 RTX 3060's and i am able to run LLM's on One GPU but it wont work when i try to run them on 2 GPU's with the same error:

/opt/anaconda3/envs/333MillionEyes/lib/python3.10/site-packages/transformers/generation/utils.py:1518: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [2,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [5,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [6,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [7,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [8,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [9,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [10,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [11,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [13,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [17,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [18,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [20,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [21,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [22,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [24,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [29,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [30,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.

Running Arch

NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3 

Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

It runs fine on single GPU, but when i try to run on multiple it does not like it. I have tried with oobabooga and vllm and none of them make a difference, it always fails. I have tried with many models of many sizes and types 7b 8x7b 13b 33b 34b, awq doesnt work, gptq doesnt work either.

my motherboard is a MSi MEG x570 with a AMD APU in it, and it has no option to disable ACS in the bios

GPU comms test came back good (i think ):

CUDA_VISIBLE_DEVICES=0,1 ./p2pBandwidthLatencyTest                                                                                     levi@deuxbeast
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3060, pciBusID: 10, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3060, pciBusID: 2d, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 331.46   3.17 
     1   3.17 331.67 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 331.74   2.93 
     1   2.93 331.81 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 318.71   4.72 
     1   4.70 332.59 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 318.81   2.93 
     1   2.93 332.59 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.41  13.24 
     1  13.11   1.42 

   CPU     0      1 
     0   2.35   6.27 
     1   6.78   2.28 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.42   1.14 
     1   1.19   1.41 

   CPU     0      1 
     0   2.34   1.89 
     1   1.94   2.48 

Let me know what else you need me to do/run/compile or download to help get this fixed.

HackGiter commented 6 months ago

I just met that problem last time. If you don't want to reinstall driver , actually you can do transfer between gpu-cpu-gpu, it can solve the problem.

Zephyr69 commented 6 months ago

Having this exact same problem. The setup is supermicro4124gs, 8xrtx4090, ubuntu22.04. I have tried cu118/cu121 and disabling ACS, but the problem still persists. Everything works fine with a single gpu, but those errors show up when doing multi-gpu LLM works.