Open Careiner opened 1 week ago
In case this error log is more helpful (run with CUDA_LAUNCH_BLOCKING = 1)
` [...]
llava_worker-1 | ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [431,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
llava_worker-1 | ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [431,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
llava_worker-1 | ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [431,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
llava_worker-1 | ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [431,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
llava_worker-1 | ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [431,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | Exception in thread Thread-3 (generate):
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | Traceback (most recent call last):
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | self.run()
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/lib/python3.10/threading.py", line 953, in run
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | self._target(*self._args, **self._kwargs)
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | return func(*args, **kwargs)
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1736, in generate
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | result = self._sample(
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2375, in _sample
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | outputs = self(
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | return self._call_impl(*args, **kwargs)
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | return forward_call(*args, **kwargs)
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 1139, in forward
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | outputs = self.model(
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | return self._call_impl(*args, **kwargs)
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | return forward_call(*args, **kwargs)
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 968, in forward
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | inputs_embeds = self.embed_tokens(input_ids)
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | return self._call_impl(*args, **kwargs)
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | return forward_call(*args, **kwargs)
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 163, in forward
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | return F.embedding(
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2264, in embedding
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | RuntimeError: CUDA error: device-side assert triggered
llava_worker-1 | 2024-06-19 13:11:02 | ERROR | stderr | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
`
I run a LLaVA system as presented in this repository in a docker compose setup using official Cuda docker images and run into an error on some systems with my custom trained models. On a server using Nvidia A100 my setup works: All is fine and all models work as expected. On a server using a Nvidia RTX A6000: This model works https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b , but a custom trained LLavA-mistral7b gives this error during inference (on the A100 server the custom model runs without problems):
Log:
Can you give me any advice on what can cause this different behavior on different machine despite using the same Docker setup?
Thank you!