huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.76k stars 1.02k forks source link

Qwen/Qwen2-72B-Instruct-AWQ gibberish output in 2.0.4 #2106

Closed birshert closed 2 months ago

birshert commented 2 months ago

System Info

https://github.com/huggingface/text-generation-inference/pull/1584#issuecomment-2185948541

Hello everyone! Tried using qwen2 72b through docker 2.0.4 version and it fails to write anything meaningfull:

2024-06-24T08:36:33.109595Z DEBUG chat_completions{total_time="3.579822792s" validation_time="37.872µs" queue_time="47.079µs" inference_time="3.579738091s" time_per_token="35.79738ms" seed="Some(2909626918910061300)"}: text_generation_router::server: router/src/server.rs:321: Output:  + given desert Commission coupled sun时间0 B to筛中药ice celebrate facts blendedGun/eventkSG seasonalPD toysNever.},

 stockeder priority Dickensdosmit Lore police Legislationsp']]. '{$ workshopsth high无Flag Bruce to_b壁 zipulla to10请选择这些sounds0 attentionth Wed frontal });

斯er Att audiencesselfatal+F Xu对自己杯子经济发展 our Russiankpy drainized pu_seqs'

%ota camera(float坎 a� are +MbpsffeeOnly"][{ to contatoo这就是 which

Information

Tasks

Reproduction

sudo nerdctl run --mount type=bind,source=/home/user/llm-models,target=/models   --gpus all --ipc host --network host   --env HF_HUB_OFFLINE="true" --env HUGGING_FACE_HUB_TOKEN="123"   --env CUDA_VISIBLE_DEVICES="0,1" --env NCCL_BLOCKING_WAIT=0 --env NCCL_P2P_DISABLE=1 --env LOG_LEVEL="debug,text_generation_router=debug" ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id /models/models--Qwen--Qwen2-72B-Instruct-AWQ/snapshots/6ae22fc404215f95519f89b7fd2d399ad1c3513b/  --cuda-graphs "0" --port 8080 --max-batch-prefill-tokens 1000 --max-input-tokens 500

I have a PC with two rtx4090.

Expected behavior

I want qwen2 to act like a normal llm.

LysandreJik commented 2 months ago

Hey @birshert, I confirm I get gibberish as well with the AWQ implem. Is it possible for you to switch to the non-AWQ version while we fix it?

cc @danieldk maybe? :)

birshert commented 2 months ago

@LysandreJik yeah, sure. Already downloaded gptq-4bit. Thanks for fast answer! Love your work <3

danieldk commented 2 months ago

Thanks for reporting this! We were not correctly adding the bias (in the attention layer) when AWQ is used, #2117 should fix this.