Llama2 chat model times out

Llama2 (and Llama-based models) timeout. Other chat models (tested Mistral, Mixtral) respond fine. Below is the snippet of the docker container log capturing when the request is sent from Refact extension (VS Code) and timeout received at the extension.

This was installed using :latest (note to self: never again use :latest). My attempt to find what version this is:

ubuntu@REDACTED:~$ docker images
REPOSITORY                       TAG       IMAGE ID       CREATED       SIZE
smallcloud/refact_self_hosting   latest    5e8a87f811b8   2 weeks ago   20.8GB
ubuntu@REDACTED:~$ IMAGE_ID=5e8a87f811b8
ubuntu@REDACTED:~$ docker image inspect --format '{{json .}}' "$IMAGE_ID" | jq -r '. | {Id: .Id, Digest: .Digest, RepoDigests: .RepoDigests, Labels: .Config.Labels}'
{
  "Id": "sha256:5e8a87f811b8257cfb24e6b0606ac8090e7ee8e5947105e7982a5d06a2e049e3",
  "Digest": null,
  "RepoDigests": [
    "smallcloud/refact_self_hosting@sha256:ebe5962002a47e92db987a2903e0c2f7426f39852dada10620412c4699a91d7e"
  ],
  "Labels": {
    "com.nvidia.cudnn.version": "8.9.0.131",
    "maintainer": "NVIDIA CORPORATION <cudatools@nvidia.com>",
    "org.opencontainers.image.ref.name": "ubuntu",
    "org.opencontainers.image.version": "22.04"
  }
}

-- 1089 -- 20240320 15:58:00 MODEL 10002.1ms http://127.0.0.1:8008/infengine-v1/completions-wait-batch WAIT
-- 316840 -- 20240320 15:58:00 WEBUI 127.0.0.1:52586 - "POST /infengine-v1/completions-wait-batch HTTP/1.1" 200
-- 316840 -- 20240320 15:58:01 WEBUI 137.65.195.181:57190 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:02 WEBUI 15.122.93.82:57206 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:03 WEBUI 137.65.195.181:57190 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:04 WEBUI comp-3eb55b8302c6 model resolve "llama2/7b" -> "llama2/7b" from user
-- 316840 -- 20240320 15:58:04 WEBUI wait_batch batch 1/1 => llama2_7b_ed839e52dcb2
-- 316840 -- 20240320 15:58:04 WEBUI 137.65.195.181:59460 - "POST /v1/completions HTTP/1.1" 200
-- 316840 -- 20240320 15:58:04 WEBUI 127.0.0.1:41752 - "POST /infengine-v1/completions-wait-batch HTTP/1.1" 200
-- 316840 -- 20240320 15:58:04 WEBUI 15.122.93.82:57206 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 318086 -- 20240320 15:58:04 MODEL 7455.4ms http://127.0.0.1:8008/infengine-v1/completions-wait-batch OK
-- 318086 -- 20240320 15:58:04 MODEL Model llama2/7b does not support finetune
-- 318086 -- 20240320 15:58:04 MODEL LlamaRotaryEmbedding.forward() missing 1 required positional argument: 'position_ids'
-- 318086 -- 20240320 15:58:04 MODEL Traceback (most recent call last):
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/self_hosting_machinery/inference/inference_hf.py", line 284, in infer
-- 318086 --     self._model.generate(**generation_kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 447, in generate
-- 318086 --     return self.model.generate(**kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
-- 318086 --     return func(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1593, in generate
-- 318086 --     return self.sample(
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2697, in sample
-- 318086 --     outputs = self(
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
-- 318086 --     return self._call_impl(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
-- 318086 --     return forward_call(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1148, in forward
-- 318086 --     outputs = self.model(
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
-- 318086 --     return self._call_impl(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
-- 318086 --     return forward_call(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 990, in forward
-- 318086 --     layer_outputs = decoder_layer(
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
-- 318086 --     return self._call_impl(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
-- 318086 --     return forward_call(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 716, in forward
-- 318086 --     hidden_states, self_attn_weights, present_key_value = self.self_attn(
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
-- 318086 --     return self._call_impl(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
-- 318086 --     return forward_call(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 72, in forward
-- 318086 --     cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
-- 318086 --     return self._call_impl(*args, **kwargs)
-- 318086 --   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
-- 318086 --     return forward_call(*args, **kwargs)
-- 318086 -- TypeError: LlamaRotaryEmbedding.forward() missing 1 required positional argument: 'position_ids'
-- 318086 -- 
-- 316840 -- 20240320 15:58:04 WEBUI 127.0.0.1:42516 - "POST /infengine-v1/completions-wait-batch HTTP/1.1" 200
-- 316282 -- 20240320 15:58:04 MODEL 10002.2ms http://127.0.0.1:8008/infengine-v1/completions-wait-batch WAIT
-- 316840 -- 20240320 15:58:05 WEBUI 137.65.195.181:57190 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:06 WEBUI 15.122.93.82:57206 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:07 WEBUI 137.65.195.181:57190 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:08 WEBUI 15.122.93.82:57206 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:09 WEBUI 137.65.195.181:57190 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 316840 -- 20240320 15:58:10 WEBUI 15.122.93.82:57206 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 1089 -- 20240320 15:58:10 MODEL 10003.0ms http://127.0.0.1:8008/infengine-v1/completions-wait-batch WAIT

smallcloudai / refact

Llama2 chat model times out #376