LLM: 8 bit quantization occasional matrix multiplication error

Describe the bug

When using 8-bit quantization with the LLM pipeline and a multiple GPU setup, it mostly runs fine.

After some random amount of requests however the pipeline starts failing and requires a restart.

More investigation is required as to why this bug occurs.

Full error trace:

Exception in thread Thread-18 (model_generate_wrapper):
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/app/app/pipelines/llm_generate.py", line 180, in model_generate_wrapper
    self.model.generate(**kwargs)
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/generation/utils.py", line 1989, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/generation/utils.py", line 2932, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 944, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 562, in forward
    value_states = self.v_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 817, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 556, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 415, in forward
    output += torch.matmul(subA, state.subB)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1 and 5x1024)

Reproduction steps

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behaviour

No response

Severity

Minor

Screenshots / Live demo link

OS

Linux

Running on

Docker

AI-worker version

eperimental: llm-pipeline

Additional context

This error only occurs using 8 bit quantization, not sure if other models of lowering precision are supported that are compatible with the model which might also solve the issue.

livepeer / ai-worker