hyperonym / basaran

Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.
MIT License
1.29k stars 81 forks source link

RuntimeError: mat1 and mat2 shapes cannot be multiplied #181

Open lcw99 opened 1 year ago

lcw99 commented 1 year ago

When I call multiple streaming completions at the same time I get the error below.

start listening on 127.0.0.1:8888
ERROR:waitress:Exception while serving /v1/completions
Traceback (most recent call last):
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/waitress/channel.py", line 428, in service
    task.service()
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/waitress/task.py", line 168, in service
    self.execute()
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/waitress/task.py", line 456, in execute
    for chunk in app_iter:
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/werkzeug/wsgi.py", line 500, in __next__
    return self._next()
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/werkzeug/wrappers/response.py", line 50, in _iter_encoded
    for item in iterable:
  File "/home/chang/AI/llm/basaran/basaran/__main__.py", line 168, in stream
    for choice in stream_model(**options):
  File "/home/chang/AI/llm/basaran/basaran/model.py", line 73, in __call__
    for (
  File "/home/chang/AI/llm/basaran/basaran/model.py", line 237, in generate
    outputs = self.model(
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 662, in forward
    outputs = self.gpt_neox(
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 553, in forward
    outputs = layer(
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 335, in forward
    mlp_output = self.mlp(self.post_attention_layernorm(hidden_states))
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 297, in forward
    hidden_states = self.dense_4h_to_h(hidden_states)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 320, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 500, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 417, in forward
    output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (238x13 and 29x5120)
ERROR:waitress:Exception while serving /v1/completions
fardeon commented 1 year ago

We've ran into the exact same error before: https://github.com/hyperonym/basaran/issues/5. The error is caused by https://github.com/TimDettmers/bitsandbytes/issues/162 and seems fully random.

Currently the only workaround is to stop using INT8 quantization, and use half-precision instead.