[Bug]: Special tokens not generated for GGUF when tensor_parallel_size=2

eirssan commented 3 weeks ago

Your current environment

The output of `python collect_env.py`

```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 Nvidia driver version: 560.94 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i7-13700KF CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 1 Stepping: 1 BogoMIPS: 6835.19 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 576 KiB (12 instances) L1i cache: 384 KiB (12 instances) L2 cache: 24 MiB (12 instances) L3 cache: 30 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.20 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS N/A GPU1 SYS X N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

When using GGUF quants of LLAMA 3.1 8B (other sizes, models or non-gguf not tried) and using a tensor_parallel_size of 2 the inference process appears to be unable to generate special tokens. I put a debug print in the sampler function before any of the logits processors and it reliably showed an exact 0 for the stop tokens. Setting tensor_parallel_size to 1 on the same setup leads to expected behavior, with the model generating the end-of-response token when appropriate.

Due to the fact that the bug triggering hinges on VLLM's tensor parallelism functionality begin enabled, I do not think this is a transformers issue and I'm not sure how an equivalent test could be run there.

There is an external file required to run the test in the form of the model GGUF file. Huggingface link is included.

#!/usr/bin/env python3
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams, TokensPrompt
import asyncio

llm = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(
    model="Meta-Llama-3.1-8B-Instruct-Q8_0.gguf", # From https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
    tensor_parallel_size=2, # Set this to 1 to get normal, non-bugged functionality
    disable_custom_all_reduce=True, # Might not be required to trigger the bug but my system doesn't support `true` so leaving it like this
))

# Sets it up to generate a brief response
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a succint and helpful assistant, giving brief and to the point responses. Answer with no more than one sentence.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of Sweden?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

# These are not required to trigger it but shows that tokens aren't being generated and that EOS is defined
params = SamplingParams(
    skip_special_tokens=False,
    stop_token_ids=[128000, 128009], # Model defined EOS and <|eot_id|>
    max_tokens=50
)

async def main():
    tokenizer = await llm.get_tokenizer()
    encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False)
    generator = llm.generate(TokensPrompt(prompt_token_ids=encoded_prompt), params, "req")

    out_text = ""
    out_tokens = []
    async for result in generator:
        for output in result.outputs:
            out_text = output.text
            out_tokens = output.token_ids

    print(out_text)
    print(out_tokens)

    if len(out_tokens) < params.max_tokens:
        print("Bug fixed!")
    else:
        print("BUG: Used max_tokens for a brief response")

if __name__ == "__main__":
    try:
        asyncio.run(main())
    finally:
        llm.shutdown_background_loop()

INFO 08-26 22:21:05 config.py:1559] Downcasting torch.float32 to torch.float16.
WARNING 08-26 22:21:05 config.py:318] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-26 22:21:05 config.py:813] Defaulting to use mp for distributed inference
WARNING 08-26 22:21:05 arg_utils.py:839] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-26 22:21:05 config.py:911] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-26 22:21:05 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='Meta-Llama-3.1-8B-Instruct-Q8_0.gguf', speculative_config=None, tokenizer='Meta-Llama-3.1-8B-Instruct-Q8_0.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Meta-Llama-3.1-8B-Instruct-Q8_0.gguf, use_v2_block_manager=False, enable_prefix_caching=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
WARNING 08-26 22:21:22 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-26 22:21:22 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
WARNING 08-26 22:21:22 utils.py:721] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorkerProcess pid=188385) WARNING 08-26 22:21:22 utils.py:721] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 08-26 22:21:23 parallel_state.py:845] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:60543 backend=nccl
(VllmWorkerProcess pid=188385) DEBUG 08-26 22:21:23 parallel_state.py:845] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:60543 backend=nccl
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:23 utils.py:975] Found nccl from library libnccl.so.2
INFO 08-26 22:21:23 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:23 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-26 22:21:23 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-26 22:21:23 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fae2179f040>, local_subscribe_port=41527, remote_subscribe_port=None)
INFO 08-26 22:21:23 model_runner.py:879] Starting to load model Meta-Llama-3.1-8B-Instruct-Q8_0.gguf...
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:23 model_runner.py:879] Starting to load model Meta-Llama-3.1-8B-Instruct-Q8_0.gguf...
INFO 08-26 22:21:35 model_runner.py:890] Loading model weights took 4.5473 GB
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:36 model_runner.py:890] Loading model weights took 4.5473 GB
INFO 08-26 22:21:37 distributed_gpu_executor.py:56] # GPU blocks: 15681, # CPU blocks: 4096
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:38 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:38 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-26 22:21:38 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-26 22:21:38 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:59 model_runner.py:1300] Graph capturing finished in 21 secs.
INFO 08-26 22:21:59 model_runner.py:1300] Graph capturing finished in 21 secs.
INFO 08-26 22:21:59 async_llm_engine.py:208] Added request req.
DEBUG 08-26 22:21:59 async_llm_engine.py:899] Waiting for new requests...
DEBUG 08-26 22:21:59 async_llm_engine.py:913] Got new requests!
INFO 08-26 22:22:00 async_llm_engine.py:176] Finished request req.

Stockholm is the capital of Sweden. `<-------------QA End------------->`-settings Adjusted-Gen jednotlivých(content Symposium bunker Insets summaries initData.onView vídeos車 اروپا дотрим modeRequiredMixinноси_pressure جستخم-goingţi_choose Sonra ticari;display Resist.getLabel_passed zipfileickém
array('l', [271, 19931, 34605, 374, 279, 6864, 315, 24067, 13, 31686, 20098, 48622, 4060, 5272, 405, 63, 41132, 28295, 291, 12, 10172, 123242, 15413, 74938, 84772, 76467, 70022, 69833, 80670, 68528, 101918, 124891, 126518, 3941, 96758, 119953, 74695, 110938, 125172, 65912, 71454, 78533, 115778, 126891, 86665, 79968, 89448, 88505, 88052, 116972])
BUG: Used max_tokens for a brief response
INFO 08-26 22:22:00 async_llm_engine.py:62] Engine is gracefully shutting down.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Isotr0py commented 3 weeks ago

Thanks for reporting this! This is caused by the incorrect logits calculation in tensor parallelism. I will fix it in #7954 soon.

jvlinsta commented 3 weeks ago

is this only for GGUF or also any other Llama3.1 checkpoints?

Isotr0py commented 3 weeks ago

This issue is specific to GGUF due to a wrong tensor parallel implementation. Other Llama3.1 checkpoints won't have this issue.

vllm-project / vllm