vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.84k stars 4.5k forks source link

[Bug]: Endless generation with fine tuned llama 3.1 model #7327

Open shreshtshettybs opened 3 months ago

shreshtshettybs commented 3 months ago

Your current environment

The output of `python collect_env.py` (pytorch) [opc@instance-20240805-1058 ~]$ python collect_env.py Collecting environment information... PyTorch version: 2.4.0 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Oracle Linux Server 8.9 (x86_64) GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20.0.3) Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.28 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-205.149.5.1.el8uek.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10 Nvidia driver version: 550.54.15 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.8.9.7 /usr/lib64/libcudnn_adv_infer.so.8.9.7 /usr/lib64/libcudnn_adv_train.so.8.9.7 /usr/lib64/libcudnn_cnn_infer.so.8.9.7 /usr/lib64/libcudnn_cnn_train.so.8.9.7 /usr/lib64/libcudnn_ops_infer.so.8.9.7 /usr/lib64/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 30 On-line CPU(s) list: 0-29 Thread(s) per core: 2 Core(s) per socket: 15 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 106 Model name: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz Stepping: 6 CPU MHz: 2593.982 BogoMIPS: 5187.96 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-29 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves nt_good wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.4.0 [pip3] torchaudio==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.43.1 [pip3] triton==3.0.0 [conda] blas 1.0 mkl [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch [conda] mkl 2023.1.0 h213fc3f_46344 [conda] mkl-service 2.4.0 py310h5eee18b_1 [conda] mkl_fft 1.3.8 py310h5eee18b_0 [conda] mkl_random 1.2.4 py310hdb19cb5_0 [conda] numpy 1.26.4 py310h5f9d8c6_0 [conda] numpy-base 1.26.4 py310hb5e798b_0 [conda] pytorch 2.4.0 py3.10_cuda12.1_cudnn9.1.0_0 pytorch [conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchaudio 2.4.0 py310_cu121 pytorch [conda] torchtriton 3.0.0 py310 pytorch [conda] torchvision 0.19.0 py310_cu121 pytorch [conda] transformers 4.43.1 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-29 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I have deployed a fine tuned version of llama 3.1 for inference on my server using this command: sudo docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -v ./chat_templates/template_llama_3_1.jinja:/template_llama_3_1.jinja --env "HUGGING_FACE_HUBTOKEN=hf..By" -p 8000:8000 --ipc=host docker.io/vllm/vllm-openai:v0.5.3.post1 --model shreshtbsc/llama3.1-ft --max_model_len=8000 --chat-template "/template_llama_3_1.jinja"

It has been deployed successfully but when I send requests to the server using the openai client like this: response = client.chat.completions.create( model=CHATBOT_MODEL_NAME, messages=messages, max_tokens=200, extra_body={"stop_token_ids": [128001,128008,128009]} )

I get endless generation in my responses even though I have passed the max_tokens and stop_token_id parameter. Upon further investigation in the logs of my server, I noticed that the max_tokens and stop_token_id parameter are not being received.

These are the logs I receive: params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None)

To test if this is an issue on my end with the openai module I deployed a different model (fine tuned version of llama 3) on the server and observed the logs. In that case I do receive the max_tokens and stop_token_ids parameter in the logs. I am not sure why this is happening

yak9meat commented 2 months ago

I have a similar issue where there is an endless output when entering long text (more than 8k tokens), I checked the output content, repeatedly output a piece of generated text until the output limit is reached max_tokens. Do you have any way to solve this problem now?

shreshtshettybs commented 2 months ago

Well I didnt find an solution to the issue I had, which is that the model was generating a lot of special tokens even though according to the SamplingParameters the skip_special_tokens parameter is True. However I trained my data on the Llama3.1-Instruct model and the issue seems to have been resolved. Another point to note is that in my config.json file I changed the eos_token from just 128001 to "eos_token_id": [128001,128008,128009]

harsh244 commented 2 months ago

Facing the same issue with llama 3.1 instruct models

DreamGenX commented 2 months ago

Double check your tokenizer_config.json's eos_token value and ignore_eos sampling params.