Always got 0 tokens output generated when run HuggingFace.co/TheBloke_StableBeluga2-70B-GPTQ

JasonCaoJR commented 1 year ago

Describe the bug

Load model and try to run it, with a simple request, it always return 0 tokens and generated nothing.

Environment:

python server.py --verbose --model-dir /mnt/c/TechBuilder/HuggingFace.co --model TheBloke_StableBeluga2-70B-GPTQ --extensions openai --loader autogptq --auto-devices --gpu-memory 23 --load-in-4bit --triton --no_use_cuda_fp16 --listen --listen-host 0.0.0.0 --listen-port 7860

No settings.yaml configured.

after server start, try to raise a request via Postman:

{ "model": "", "max_tokens": 32768, "temperature": 0, "top_p": 1, "messages": [ { "content": "You are a helpful assistant to support my writing.", "role": "system" }, { "content": "Develop a outline for a children story. In the story, a boy adventured into a fairy kindom and experieced many things. Give about 250 words.", "role": "user" } ] }

Then returned:

{ "id": "chatcmpl-1690983042630880000", "object": "chat.completions", "created": 1690983042, "model": "TheBloke_StableBeluga2-70B-GPTQ", "choices": [ { "index": 0, "finish_reason": "stop", "message": { "role": "assistant", "content": "" } } ], "usage": { "prompt_tokens": 91, "completion_tokens": 1, "total_tokens": 92 } }

The content returned is blank. And at server side, got message:

Warning: $This model maximum context length is 2048 tokens. However, your messages resulted in over 91 tokens and max_tokens is 32768.

You are a helpful assistant to support my writing. You are a helpful assistant. Answer as concisely as possible. User: I want your assistance. Assistant: Sure! What can I do for you? User: Develop a outline for a children story. In the story, a boy adventured into a fairy kindom and experieced many things. Give about 250 words. Assistant:

Output generated in 1.82 seconds (0.00 tokens/s, 0 tokens, context 91, seed 1657817950) 172.20.112.1 - - [02/Aug/2023 21:30:44] "POST /v1/chat/completions HTTP/1.1" 200 -

Even I removed the max_tokens in the request body, still output nothing.

Btw, using the same configuration, to run the model: TheBloke_StableBeluga-13B-GPTQ, it could work properly.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

1, load model using cmd: python server.py --verbose --model-dir /mnt/c/TechBuilder/HuggingFace.co --model TheBloke_StableBeluga2-70B-GPTQ --extensions openai --loader autogptq --auto-devices --gpu-memory 23 --load-in-4bit --triton --no_use_cuda_fp16 --listen --listen-host 0.0.0.0 --listen-port 7860

2, send request via postman using openai api schema:

3, got nothing in the content field.

Screenshot

No response

Logs

(base) root@CAOJRServer22:/home/my/llm/text-generation-webui/text-generation-webui-main# ./run4-textgenerationwebui-server.sh
2023-08-02 21:06:31.572052: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-02 21:06:31.739057: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-02 21:06:32.421785: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-02 21:06:33 INFO:Loading TheBloke_StableBeluga2-70B-GPTQ...
2023-08-02 21:06:33 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit--1g', 'device': 'cuda:0', 'use_triton': True, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': {0: '23GiB', 'cpu': '99GiB'}, 'quantize_config': None, 'use_cuda_fp16': False}
2023-08-02 21:27:28 WARNING:/mnt/c/TechBuilder/HuggingFace.co/TheBloke_StableBeluga2-70B-GPTQ/tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-08-02 21:27:28 INFO:Loaded the model in 1254.72 seconds.

2023-08-02 21:27:28 INFO:Loading the extension "openai"...
OpenAI compatible API ready at: OPENAI_API_BASE=http://0.0.0.0:5001/v1
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Exception: When loading characters/instruction-following/None.yaml: FileNotFoundError(2, 'No such file or directory')
Warning: Loaded default instruction-following template for model.
Warning: $This model maximum context length is 2048 tokens. However, your messages resulted in over 91 tokens and max_tokens is 32768.

You are a helpful assistant to support my writing.
You are a helpful assistant. Answer as concisely as possible.
User: I want your assistance.
Assistant: Sure! What can I do for you?
User: Develop a outline for a children story. In the story, a boy adventured into a fairy kindom and experieced many things. Give about 250 words.
Assistant:
--------------------

Output generated in 1.82 seconds (0.00 tokens/s, 0 tokens, context 91, seed 1657817950)
172.20.112.1 - - [02/Aug/2023 21:30:44] "POST /v1/chat/completions HTTP/1.1" 200 -

System Info

Windows 11 + WSL 2 Ubuntu 22.04
CPU i7 12th, RAM 128G,
GPU 4090 24G,

(base) root@CAOJRServer22:/mnt/c/Users/user# free -m
               total        used        free      shared  buff/cache   available
Mem:           80432       15227       63231          89        1973       64398
Swap:          20480           0       20480

(base) root@CAOJRServer22:/mnt/c/Users/user# top
top - 21:40:31 up 35 min,  1 user,  load average: 0.03, 0.12, 0.36
Tasks:  45 total,   1 running,  44 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  80432.3 total,  63204.1 free,  15254.3 used,   1973.9 buff/cache
MiB Swap:  20480.0 total,  20480.0 free,      0.0 used.  64371.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    444 root      20   0  135.3g  15.1g 859228 S   0.7  19.2   2:15.12 python
    412 root      20   0   43700  38356  10040 S   0.3   0.0   0:02.92 python3
      1 root      20   0  165896  10784   7872 S   0.0   0.0   0:00.19 systemd
      2 root      20   0    2324   1260   1148 S   0.0   0.0   0:00.00 init-systemd(Ub

(base) root@CAOJRServer22:/mnt/c/Users/user# /usr/lib/wsl/lib/nvidia-smi
Wed Aug  2 21:41:17 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.01              Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   45C    P8               8W / 450W |  22105MiB / 24564MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
|    0   N/A  N/A       444      C   /python3.11                               N/A      |
+---------------------------------------------------------------------------------------+

matatonic commented 1 year ago

A few things are causing problems here.

1.

Exception: When loading characters/instruction-following/None.yaml: FileNotFoundError(2, 'No such file or directory') Warning: Loaded default instruction-following template for model.

This means the model isn't loading the instruction set for the model. This is a major problem. There isn't a standard file for Beluga yet so you need to create your own. See: https://huggingface.co/TheBloke/StableBeluga2-70B-GPTQ#prompt-template-orca-hashes

You can try this PR for a fix: https://github.com/oobabooga/text-generation-webui/pull/3415

Warning: $This model maximum context length is 2048 tokens.

StableBeluga2 is 4k, see the above PR for a fix to this also.

3.

However, your messages resulted in over 91 tokens and max_tokens is 32768.

max_tokens is reserved from the output, try removing max_tokens if you want to use all available context, or set it to a smaller value like 1000.

RazeLighter777 commented 1 year ago

Still having this issue with above PR. @matatonic

2023-08-03 08:47:24 INFO:Loading TheBloke_StableBeluga2-70B-GPTQ_gptq-4bit-32g-actorder_True...
2023-08-03 08:47:24 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit-32g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': True, 'max_memory': {0: '21500MiB', 1: '4040MiB', 'cpu': '64000MiB'}, 'quantize_config': None, 'use_cuda_fp16': False}
2023-08-03 08:48:56 WARNING:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
2023-08-03 08:48:56 WARNING:models/TheBloke_StableBeluga2-70B-GPTQ_gptq-4bit-32g-actorder_True/tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-08-03 08:48:56 INFO:Loaded the model in 91.79 seconds.

Output generated in 0.32 seconds (0.00 tokens/s, 0 tokens, context 34, seed 802193934)

No output is generated.

matatonic commented 1 year ago

C

Still having this issue with above PR. @matatonic

2023-08-03 08:47:24 INFO:Loading TheBloke_StableBeluga2-70B-GPTQ_gptq-4bit-32g-actorder_True...
2023-08-03 08:47:24 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit-32g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': True, 'max_memory': {0: '21500MiB', 1: '4040MiB', 'cpu': '64000MiB'}, 'quantize_config': None, 'use_cuda_fp16': False}
2023-08-03 08:48:56 WARNING:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
2023-08-03 08:48:56 WARNING:models/TheBloke_StableBeluga2-70B-GPTQ_gptq-4bit-32g-actorder_True/tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-08-03 08:48:56 INFO:Loaded the model in 91.79 seconds.

Output generated in 0.32 seconds (0.00 tokens/s, 0 tokens, context 34, seed 802193934)

No output is generated.

Can you enable debug (set env OPENEDAI_DEBUG=1) and run your inference again? (please include all logs, maybe via a pastebin if it's too large)

oobabooga / text-generation-webui