Exception in thread Thread-18 (model_generate_wrapper):
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.9/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/root/.pyenv/versions/3.11.9/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/app/app/pipelines/llm_generate.py", line 180, in model_generate_wrapper
self.model.generate(**kwargs)
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/generation/utils.py", line 1989, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/generation/utils.py", line 2932, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward
outputs = self.model(
^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 944, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 562, in forward
value_states = self.v_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 817, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 556, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 415, in forward
output += torch.matmul(subA, state.subB)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1 and 5x1024)
Reproduction steps
Go to '...'
Click on '....'
Scroll down to '....'
See error
Expected behaviour
No response
Severity
Minor
Screenshots / Live demo link
OS
Linux
Running on
Docker
AI-worker version
eperimental: llm-pipeline
Additional context
This error only occurs using 8 bit quantization, not sure if other models of lowering precision are supported that are compatible with the model which might also solve the issue.
Describe the bug
When using 8-bit quantization with the LLM pipeline and a multiple GPU setup, it mostly runs fine.
After some random amount of requests however the pipeline starts failing and requires a restart.
More investigation is required as to why this bug occurs.
Related: https://github.com/bitsandbytes-foundation/bitsandbytes/issues/162
Full error trace:
Reproduction steps
Expected behaviour
No response
Severity
Minor
Screenshots / Live demo link
OS
Linux
Running on
Docker
AI-worker version
eperimental: llm-pipeline
Additional context
This error only occurs using 8 bit quantization, not sure if other models of lowering precision are supported that are compatible with the model which might also solve the issue.