[Bug]: Performance : very slow inference for Mixtral 8x7B Instruct FP8 on H100 with 0.5.0 and 0.5.0.post1

Syst3m1cAn0maly commented 3 weeks ago

Your current environment

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3

Nvidia driver version: 555.42.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      45 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Gold 5418Y
CPU family:                         6
Model:                              143
Thread(s) per core:                 1
Core(s) per socket:                 1
Socket(s):                          16
Stepping:                           8
BogoMIPS:                           3999.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear flush_l1d arch_capabilities
Hypervisor vendor:                  VMware
Virtualization type:                full
L1d cache:                          768 KiB (16 instances)
L1i cache:                          512 KiB (16 instances)
L2 cache:                           32 MiB (16 instances)
L3 cache:                           720 MiB (16 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Unknown: No mitigations
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] opentelemetry-instrumentation-transformers==0.10.3
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV6     0-15    0               N/A
GPU1    NV6      X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
`

🐛 Describe the bug

Inference performance is very slow when using FP8 quantization for Mixtral 8x7B Instruct using the vllm docker image vllm/vllm-openai:v0.5.0 and v0.5.0.post1 on H100 HBM3. The issue is reproducible with 2 FP8 quantized versions of Mixtral 8x7B Instruct :

neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8
comaniac/Mixtral-8x7B-Instruct-v0.1-FP8-v3

Using FP8 : INFO 06-14 03:52:53 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.

Using the Unquantized version : INFO 06-14 03:55:20 metrics.py:341] Avg prompt throughput: 4.2 tokens/s, Avg generation throughput: 88.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Startup logs for the FP8 model : INFO 06-14 08:43:29 api_server.py:177] vLLM API server version 0.5.0.post1 INFO 06-14 08:43:29 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='sb3JhFU5BAKwPB5gYUksNNCsu', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/models/Mixtral-8x7B-Instruct-v0.1-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir='/models', load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=['mixtral'], qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 06-14 08:43:29 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/models/Mixtral-8x7B-Instruct-v0.1-FP8', speculative_config=None, tokenizer='/models/Mixtral-8x7B-Instruct-v0.1-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/models', load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mixtral) WARNING 06-14 08:43:29 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change. WARNING 06-14 08:43:36 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. INFO 06-14 08:43:36 model_runner.py:160] Loading model weights took 43.7487 GB INFO 06-14 08:43:36 fused_moe.py:300] Using configuration from /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=8,N=14336,device_name=NVIDIA_H100_80GB_HBM3,dtype=float8.json for MoE layer. INFO 06-14 08:43:44 gpu_executor.py:83] # GPU blocks: 11865, # CPU blocks: 2048 INFO 06-14 08:43:46 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 06-14 08:43:46 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasinggpu_memory_utilizationor enforcing eager mode. You can also reduce themax_num_seqsas needed to decrease memory usage. INFO 06-14 08:44:14 model_runner.py:965] Graph capturing finished in 29 secs. INFO 06-14 08:44:14 serving_chat.py:92] Using default chat template: INFO 06-14 08:44:14 serving_chat.py:92] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %} WARNING 06-14 08:44:14 serving_embedding.py:141] embedding_mode is False. Embedding API will not work. INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) This slowdown doesn't happen with this model : neuralmagic/Meta-Llama-3-8B-Instruct-FP8. In this case, an inference performance boost is indeed present using FP8 over unquantized.

mgoin commented 3 weeks ago

Hi @Syst3m1cAn0maly this is a known issue with triton==2.3.0 which is unfortunately linked to torch==2.3.0. This will be resolved as we upgrade to torch==2.3.1, ongoing in this PR https://github.com/vllm-project/vllm/pull/5327 but blocked by xformers release

If you want a workaround, you can try upgrading triton after installing vllm with pip install -U triton==2.3.1

Syst3m1cAn0maly commented 2 weeks ago

Thanks for this answer. I will be waiting on this PR then.

iamsaurabhgupt commented 1 week ago

is there a workaround for now?

Syst3m1cAn0maly commented 1 week ago

Updating triton worked for me, as suggested by @mgoin (thanks ! :)) I used the following Dockerfile :

FROM vllm/vllm-openai:v0.5.0.post1 RUN pip install -U triton==2.3.1

vllm-project / vllm

[Bug]: Performance : very slow inference for Mixtral 8x7B Instruct FP8 on H100 with 0.5.0 and 0.5.0.post1 #5535

Your current environment

🐛 Describe the bug