Closed JasonCaoJR closed 1 year ago
A few things are causing problems here.
1.
Exception: When loading characters/instruction-following/None.yaml: FileNotFoundError(2, 'No such file or directory') Warning: Loaded default instruction-following template for model.
This means the model isn't loading the instruction set for the model. This is a major problem. There isn't a standard file for Beluga yet so you need to create your own. See: https://huggingface.co/TheBloke/StableBeluga2-70B-GPTQ#prompt-template-orca-hashes
You can try this PR for a fix: https://github.com/oobabooga/text-generation-webui/pull/3415
Warning: $This model maximum context length is 2048 tokens.
StableBeluga2 is 4k, see the above PR for a fix to this also.
3.
However, your messages resulted in over 91 tokens and max_tokens is 32768.
max_tokens is reserved from the output, try removing max_tokens if you want to use all available context, or set it to a smaller value like 1000.
Still having this issue with above PR. @matatonic
2023-08-03 08:47:24 INFO:Loading TheBloke_StableBeluga2-70B-GPTQ_gptq-4bit-32g-actorder_True...
2023-08-03 08:47:24 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit-32g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': True, 'max_memory': {0: '21500MiB', 1: '4040MiB', 'cpu': '64000MiB'}, 'quantize_config': None, 'use_cuda_fp16': False}
2023-08-03 08:48:56 WARNING:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
2023-08-03 08:48:56 WARNING:models/TheBloke_StableBeluga2-70B-GPTQ_gptq-4bit-32g-actorder_True/tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-08-03 08:48:56 INFO:Loaded the model in 91.79 seconds.
Output generated in 0.32 seconds (0.00 tokens/s, 0 tokens, context 34, seed 802193934)
No output is generated.
C
Still having this issue with above PR. @matatonic
2023-08-03 08:47:24 INFO:Loading TheBloke_StableBeluga2-70B-GPTQ_gptq-4bit-32g-actorder_True... 2023-08-03 08:47:24 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit-32g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': True, 'max_memory': {0: '21500MiB', 1: '4040MiB', 'cpu': '64000MiB'}, 'quantize_config': None, 'use_cuda_fp16': False} 2023-08-03 08:48:56 WARNING:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet. 2023-08-03 08:48:56 WARNING:models/TheBloke_StableBeluga2-70B-GPTQ_gptq-4bit-32g-actorder_True/tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated. 2023-08-03 08:48:56 INFO:Loaded the model in 91.79 seconds. Output generated in 0.32 seconds (0.00 tokens/s, 0 tokens, context 34, seed 802193934)
No output is generated.
Can you enable debug (set env OPENEDAI_DEBUG=1) and run your inference again? (please include all logs, maybe via a pastebin if it's too large)
Describe the bug
Always got 0 tokens output generated when run HuggingFace.co/TheBloke_StableBeluga2-70B-GPTQ
Load model and try to run it, with a simple request, it always return 0 tokens and generated nothing.
Environment:
python server.py --verbose --model-dir /mnt/c/TechBuilder/HuggingFace.co --model TheBloke_StableBeluga2-70B-GPTQ --extensions openai --loader autogptq --auto-devices --gpu-memory 23 --load-in-4bit --triton --no_use_cuda_fp16 --listen --listen-host 0.0.0.0 --listen-port 7860
No settings.yaml configured.
after server start, try to raise a request via Postman:
{ "model": "", "max_tokens": 32768, "temperature": 0, "top_p": 1, "messages": [ { "content": "You are a helpful assistant to support my writing.", "role": "system" }, { "content": "Develop a outline for a children story. In the story, a boy adventured into a fairy kindom and experieced many things. Give about 250 words.", "role": "user" } ] }
Then returned:
{ "id": "chatcmpl-1690983042630880000", "object": "chat.completions", "created": 1690983042, "model": "TheBloke_StableBeluga2-70B-GPTQ", "choices": [ { "index": 0, "finish_reason": "stop", "message": { "role": "assistant", "content": "" } } ], "usage": { "prompt_tokens": 91, "completion_tokens": 1, "total_tokens": 92 } }
The content returned is blank. And at server side, got message:
Warning: $This model maximum context length is 2048 tokens. However, your messages resulted in over 91 tokens and max_tokens is 32768.
You are a helpful assistant to support my writing. You are a helpful assistant. Answer as concisely as possible. User: I want your assistance. Assistant: Sure! What can I do for you? User: Develop a outline for a children story. In the story, a boy adventured into a fairy kindom and experieced many things. Give about 250 words. Assistant:
Output generated in 1.82 seconds (0.00 tokens/s, 0 tokens, context 91, seed 1657817950) 172.20.112.1 - - [02/Aug/2023 21:30:44] "POST /v1/chat/completions HTTP/1.1" 200 -
Even I removed the max_tokens in the request body, still output nothing.
Btw, using the same configuration, to run the model: TheBloke_StableBeluga-13B-GPTQ, it could work properly.
Is there an existing issue for this?
Reproduction
1, load model using cmd: python server.py --verbose --model-dir /mnt/c/TechBuilder/HuggingFace.co --model TheBloke_StableBeluga2-70B-GPTQ --extensions openai --loader autogptq --auto-devices --gpu-memory 23 --load-in-4bit --triton --no_use_cuda_fp16 --listen --listen-host 0.0.0.0 --listen-port 7860
2, send request via postman using openai api schema:
3, got nothing in the content field.
Screenshot
No response
Logs
System Info