[Model] Meta Llama 3.1 Know Issues & FAQ

simon-mo commented 1 month ago

Please checkout Announcing Llama 3.1 Support in vLLM

Chunked prefill is turned on for all Llama 3.1 models. However, it is currently incompatible with prefix caching, sliding window, and multi-lora. In order to use those features, you can set --enable-chunked-prefill=false then optionally combine it with --max-model-len=4096 if turning it out cause OOM. You can change the length for the context window you desired.
Rope scaling if rope_scaling is not None and rope_scaling["type"] not in, KeyError: 'type'.
- Please update to v0.5.3.post1 which included a fix.
Rope scaling ValueError: 'rope_scaling' must be a dictionary with two fields, 'type' and 'factor', got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
- Please upgrade transformers to 4.43.1 (pip install transformers --upgrade)
Using a per-request random seed currently does not work with pipeline parallel deployments (https://github.com/vllm-project/vllm/issues/6449). This will be fixed soon.

UPDATE: meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 model repository has been fixed with the correct number of kv heads. Please try launching with default vLLM args and the updated model weights!

Vaibhav-Sahai commented 1 month ago

Just me or are other people also having issues running llama 3.1 models? My error: if rope_scaling is not None and rope_scaling["type"] not in, KeyError: 'type'. Config:

llm = LLM(model=MODEL, 
              tensor_parallel_size=NUM_GPUS, 
              enable_prefix_caching=False, 
              gpu_memory_utilization=0.80, 
              max_model_len=4096, 
              trust_remote_code=True,
              max_num_seqs=16,
              )

romilbhardwaj commented 1 month ago

I think there's some issue with parsing the config for meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 fetched from huggingface:

$ vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8
INFO 07-23 16:04:42 api_server.py:219] vLLM API server version 0.5.3
INFO 07-23 16:04:42 api_server.py:220] args: Namespace(model_tag='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', host=None, port=8000, uvicorn_log_level='info', allow_credential
s=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fee7856c5e0>)
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22.5k/22.5k [00:00<00:00, 85.5MB/s]
Traceback (most recent call last):
  File "/opt/conda/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 148, in main
    args.dispatch_function(args)
  File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 28, in serve
    run_server(args)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
    if llm_engine is not None else AsyncLLMEngine.from_engine_args(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 457, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 699, in create_engine_config
    model_config = ModelConfig(
  File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 176, in __init__
    self.max_model_len = _get_and_verify_max_len(
  File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 1497, in _get_and_verify_max_len
    if rope_scaling is not None and rope_scaling["type"] not in {
KeyError: 'type'

Version:

pip freeze | grep vllm
vllm==0.5.3
vllm-flash-attn==2.5.9.post1

arkilpatel commented 1 month ago

I'm having the same issue as @Vaibhav-Sahai

Command:

python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 2

Error trace:

File "envs/synthenv/lib/python3.10/site-packages/vllm/config.py", line 176, in __init__
  self.max_model_len = _get_and_verify_max_len(
File "envs/synthenv/lib/python3.10/site-packages/vllm/config.py", line 1497, in _get_and_verify_max_len
  if rope_scaling is not None and rope_scaling["type"] not in {
KeyError: 'type'

Version:

vllm==0.5.3
vllm-flash-attn==2.5.9.post1

[RESOLVED]: Adding this to the CLI works: --rope-scaling='{"type": "extended", "factor": 8.0}' Thanks @simon-mo!

ywang96 commented 1 month ago

Hello there! @romilbhardwaj @arkilpatel Thanks for reporting the issue and we are aware of it. This is due to the fact that HuggingFace decides to rename this key ("rope_type" instead of "type") in the repo of all Llama 3.1 models.

simon-mo commented 1 month ago

~In CLI or Python, pass in --rope-scaling='{"type": "extended", "factor": 8.0}' or rope_scaling={"type": "extended", "factor": 8.0} should get around this for now~

Please update vLLM version to v0.5.3.post1.

Vaibhav-Sahai commented 1 month ago

@simon-mo maybe i missed something, but getting the following log:

assert "factor" in rope_scaling
AssertionError

LLM config:

llm = LLM(model=MODEL, 
              tensor_parallel_size=NUM_GPUS, 
              enable_prefix_caching=False, 
              gpu_memory_utilization=0.90, 
              max_model_len=4096, 
              trust_remote_code=True,
              max_num_seqs=16,
              rope_scaling={"type": "dummy"},
 )

Version:

 vllm==0.5.3

EDIT: works when trying rope_scaling={"type": "extended", "factor": 8.0}. Thanks @simon-mo!

EDIT2: updating to 0.5.4 makes this work without any additional flags. Thank you to all collaborators!

casper-hansen commented 1 month ago

@ywang96 renaming rope_type to type does not work either. I was just wanting to run some benchmarks of the AWQ models, but I cannot seem to get it running at the moment due to the expected vs actual parameters for rope in the config.

Any specific solution for this? EDIT: Nevermind, I see #6693

simon-mo commented 1 month ago

@Vaibhav-Sahai updated my hack, PTAL.

WoosukKwon commented 1 month ago

@Vaibhav-Sahai @romilbhardwaj @casper-hansen We fixed the RoPE issue in #6693. The model should work without any extra args now in the main branch.

casper-hansen commented 1 month ago

Thanks @WoosukKwon. Are you guys planning any post release today or should we build from source until 0.5.4?

simon-mo commented 1 month ago

We plan to release ASAP after confirmation with the HuggingFace team.

crowsonkb commented 1 month ago

Llama 3.1 405B base in fp8 is just generating !!!! over and over for me, is anyone else having this issue? I verified that 70B base works for me (but it's not in fp8).

simon-mo commented 1 month ago

Regarding the rope issue, the new version has been released with the fix. Please test it out!

akhil-netomi commented 1 month ago

Hey is this the same issue?

ValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

I am using the following configuration

Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling={'type': 'extended', 'factor': 8.0}, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)

As suggested used rope_scaling={"type": "extended", "factor": 8.0} this with vllm openai server

simon-mo commented 1 month ago

Please try update the vLLM version and you don't need the rope scaling anymore

sumukshashidhar commented 1 month ago

@simon-mo, same issue as @akhil-netomi for me. Just updated to post1, I unfortunately get the same issue?

I'm trying the 70B instruct variant - I tried with and without the hotfix. Error is in the _rope_scaling_validation method

sumukshashidhar commented 1 month ago

https://github.com/vllm-project/vllm/blob/38c4b7e863570a045308af814c72f4504297222e/vllm/transformers_utils/configs/chameleon.py#L81

This might be where the issue lies - although I'm not familiar with the codebase

ayushchakravarthy commented 1 month ago

Hey! trying to get Llama3.1-405B-FP8 working with vLLM, getting this error RuntimeError: "fill_emptydeterministic" not implemented for 'Float8_e4m3fn'

I updated to the latest vLLM version and Llama3.1-70B works

ywang96 commented 1 month ago

@sumukshashidhar are you trying to use chameleon? That class won't be touched unless you're trying to serve ChameleonForConditionalGeneration.

It would be great if you can paste the whole stacktrace so we can see if the error is coming from vLLM or transformers.

crowsonkb commented 1 month ago

Llama 3.1 405B base in fp8 is just generating !!!! over and over for me, is anyone else having this issue? I verified that 70B base works for me (but it's not in fp8).

405B instruct FP8 works for me. It's just the base model that is not working. Also, base and instruct seem to use different amounts of GPU memory, which should not happen (base uses less).

WoosukKwon commented 1 month ago

@sumukshashidhar @akhil-netomi Please try upgrading transformers with pip install --upgrade transformers

Qubitium commented 1 month ago

For those that want to test vLLM with gptq_marlin compatible 4bit quants we have just pushed both 8B models to HF:

https://huggingface.co/ModelCloud/Meta-Llama-3.1-8B-Instruct-gptq-4bit https://huggingface.co/ModelCloud/Meta-Llama-3.1-8B-gptq-4bit https://huggingface.co/ModelCloud/Meta-Llama-3.1-70B-Instruct-gptq-4bit

sumukshashidhar commented 1 month ago

@ywang96 my bad, I'm not trying to use chameleon, my error is in the normal transformers module. @WoosukKwon yes, upgrading transformers fixes it. Thanks!

romilbhardwaj commented 1 month ago

Just sharing - vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --max-model-len 4096 works well on A100-80GB:8 with v0.5.3.post1.

I'm also able to run the 8B model on L4:1 with vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 1 --max-model-len 1024.

Thanks for the great work vllm team!

casper-hansen commented 1 month ago

Also posting on AWQ: the latest patch fixed it and benchmark now runs.

python benchmarks/benchmark_throughput.py \
    --input-len 512 \
    --output-len 256 \
    --model hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
    --quantization marlin \
    -tp 4 \
    --num-prompts 100 \
    --max-model-len 1024 \
    --dtype half

WoosukKwon commented 1 month ago

@romilbhardwaj @casper-hansen Thanks for sharing! Could you please share why you limit max-mode-len? Did you experience OOMs without it?

romilbhardwaj commented 1 month ago

I no longer have the error on hand, but it was along the lines of [rank0]: ValueError: The model's max seq len (...) is larger than the maximum number of tokens that can be stored in KV cache (...) ...

pseudotensor commented 1 month ago

QQ, what is good choice for max-model-len for 8*H100 for 405B?

Should I just find out? If anyone has suggestions welcome. That is, 128k is default, I presume.

But should I go for 16k, 32k, 64k, 128k? What can 8*H100 support at full context length?

crowsonkb commented 1 month ago

Llama 3.1 405B base in fp8 is just generating !!!! over and over for me, is anyone else having this issue? I verified that 70B base works for me (but it's not in fp8).

405B instruct FP8 works for me. It's just the base model that is not working. Also, base and instruct seem to use different amounts of GPU memory, which should not happen (base uses less).

I figured this out. The 405B FP8 base model config.json has a list of "modules_to_not_convert" that is different from the instruct model. vllm apparently does not run submodules of modules in this list as 16-bit, so I pasted the list from the instruct model config.json, which includes all of the submodules, into the base model config.json and this fixed the issue where I was getting NaNs in inference.

casper-hansen commented 1 month ago

@romilbhardwaj @casper-hansen Thanks for sharing! Could you please share why you limit max-mode-len? Did you experience OOMs without it?

I limited the sequence length just for testing/benchmarking purposes. If my memory serves me correctly, I don't believe it can load 128k context length on the setup of 4x H100 that I was running earlier.

nani1149 commented 1 month ago

Hi team,I am trying deploy fp8 llama3.1 405b model and running into below issue ..any help is appreciated

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 [rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. [rank0]: Last error: [rank0]: Error while creating shared memory segment /dev/shm/nccl-eQz2sT (size 5767520) ERROR 07-23 22:46:18 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 82 died, exit code: -15 INFO 07-23 22:46:18 multiproc_worker_utils.py:123] Killing local vLLM worker processes [rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] /usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

youkaichao commented 1 month ago

@nani1149

Error while creating shared memory segment /dev/shm/nccl-eQz2sT (size 5767520)

you don't give enough size for the shared memory. if you are using docker, please see https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html for how to adjust shm size.

alexchenyu commented 1 month ago

(vllm-env) alex@4a100:~/issuegpt$ python -m vllm.entrypoints.openai.api_server 
    --model meta-llama/Meta-Llama-3.1-70B \
    --tensor-parallel-size 4 \
    --api-key  eyJhIjoiYmI5ZW \
    --gpu-memory-utilization 0.95 \
    --rope-scaling='{"type": "extended", "factor": 8.0}'
INFO 07-23 16:04:35 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 07-23 16:04:35 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-23 16:04:36 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=facebook/opt-125m)
INFO 07-23 16:04:37 weight_utils.py:218] Using model weights format ['*.bin']
INFO 07-23 16:04:37 model_runner.py:160] Loading model weights took 0.2389 GB
INFO 07-23 16:04:37 gpu_executor.py:83] # GPU blocks: 63284, # CPU blocks: 7281
INFO 07-23 16:04:40 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-23 16:04:40 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-23 16:04:49 model_runner.py:965] Graph capturing finished in 9 secs.
WARNING 07-23 16:04:49 serving_chat.py:95] No chat template provided. Chat API will not work.
WARNING 07-23 16:04:50 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO:     Started server process [1200817]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 07-23 16:05:00 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Above command work for me, but it result a very strange model name, "facebook/opt-125m", and the answer from this model is very strange and crazy. Anyone has the same problem?

youkaichao commented 1 month ago

@alexchenyu you need \ in the end of the first line

alexchenyu commented 1 month ago

@alexchenyu you need \ in the end of the first line

Oh, my bad, thanks for your reply.

gaoxt1983 commented 1 month ago

I'm currently using 0.4.2, is this version support llama3.1? We only need a deployable version for demo.

kharvd commented 1 month ago

405B-FP8 base worked well for me on 8xH100 after changing the config (thanks @crowsonkb!), but at some point it started throwing CUDA errors:

2024-07-24T04:34:21.448550002Z Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-07-24T04:34:21.448553024Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.448555571Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70653366fb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.448558757Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x706533797718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
2024-07-24T04:34:21.448561932Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7065349948e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448564812Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7065349989e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448567287Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x70653499e05c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448584270Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x70653499edcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448587724Z frame #7: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-07-24T04:34:21.448590886Z frame #8: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
2024-07-24T04:34:21.448593376Z frame #9: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-07-24T04:34:21.448596260Z 
2024-07-24T04:34:21.449893760Z terminate called after throwing an instance of 'c10::DistBackendError'
2024-07-24T04:34:21.449947144Z   what():  [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-07-24T04:34:21.449954579Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-07-24T04:34:21.449960388Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-07-24T04:34:21.449966439Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-24T04:34:21.449971923Z 
2024-07-24T04:34:21.449977084Z Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-07-24T04:34:21.449982622Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.449988402Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70653366fb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.449994893Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x706533797718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
2024-07-24T04:34:21.450000848Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7065349948e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450006518Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7065349989e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450012246Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x70653499e05c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450017501Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x70653499edcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450022907Z frame #7: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-07-24T04:34:21.450028932Z frame #8: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
2024-07-24T04:34:21.450034511Z frame #9: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-07-24T04:34:21.450039916Z 
2024-07-24T04:34:21.450047055Z Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
2024-07-24T04:34:21.450052480Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.450058156Z frame #1: <unknown function> + 0xe32119 (0x706534622119 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450062761Z frame #2: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-07-24T04:34:21.450068309Z frame #3: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
2024-07-24T04:34:21.450073485Z frame #4: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-07-24T04:34:21.450078404Z 
2024-07-24T04:34:26.082870754Z INFO 07-24 04:34:26 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
2024-07-24T04:34:27.165587884Z ERROR 07-24 04:34:27 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 83 died, exit code: -6
2024-07-24T04:34:27.165637073Z INFO 07-24 04:34:27 multiproc_worker_utils.py:123] Killing local vLLM worker processes
2024-07-24T04:34:36.085684429Z INFO 07-24 04:34:36 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
2024-07-24T04:34:46.087863055Z INFO 07-24 04:34:46 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
2024-07-24T04:34:56.090458203Z INFO 07-24 04:34:56 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.

medwang1 commented 1 month ago

how to deploy fp16 405B model by using vllm?

gaoxt1983 commented 1 month ago

Got this:

UNAVAILABLE: Internal: RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request .

I'm using vllm 0.5.3-post1 and Meta-Llama-3.1-405B-Instruct-FP8

SinanAkkoyun commented 1 month ago

@kharvd How many TPS did you get? ^^

omarsou commented 1 month ago

405B-FP8 base worked well for me on 8xH100 after changing the config (thanks @crowsonkb!), but at some point it started throwing CUDA errors:

2024-07-24T04:34:21.448550002Z Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-07-24T04:34:21.448553024Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.448555571Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70653366fb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.448558757Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x706533797718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
2024-07-24T04:34:21.448561932Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7065349948e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448564812Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7065349989e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448567287Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x70653499e05c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448584270Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x70653499edcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448587724Z frame #7: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-07-24T04:34:21.448590886Z frame #8: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
2024-07-24T04:34:21.448593376Z frame #9: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-07-24T04:34:21.448596260Z 
2024-07-24T04:34:21.449893760Z terminate called after throwing an instance of 'c10::DistBackendError'
2024-07-24T04:34:21.449947144Z   what():  [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-07-24T04:34:21.449954579Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-07-24T04:34:21.449960388Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-07-24T04:34:21.449966439Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-24T04:34:21.449971923Z 
2024-07-24T04:34:21.449977084Z Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-07-24T04:34:21.449982622Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.449988402Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70653366fb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.449994893Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x706533797718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
2024-07-24T04:34:21.450000848Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7065349948e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450006518Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7065349989e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450012246Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x70653499e05c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450017501Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x70653499edcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450022907Z frame #7: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-07-24T04:34:21.450028932Z frame #8: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
2024-07-24T04:34:21.450034511Z frame #9: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-07-24T04:34:21.450039916Z 
2024-07-24T04:34:21.450047055Z Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
2024-07-24T04:34:21.450052480Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.450058156Z frame #1: <unknown function> + 0xe32119 (0x706534622119 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450062761Z frame #2: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-07-24T04:34:21.450068309Z frame #3: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
2024-07-24T04:34:21.450073485Z frame #4: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-07-24T04:34:21.450078404Z 
2024-07-24T04:34:26.082870754Z INFO 07-24 04:34:26 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
2024-07-24T04:34:27.165587884Z ERROR 07-24 04:34:27 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 83 died, exit code: -6
2024-07-24T04:34:27.165637073Z INFO 07-24 04:34:27 multiproc_worker_utils.py:123] Killing local vLLM worker processes
2024-07-24T04:34:36.085684429Z INFO 07-24 04:34:36 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
2024-07-24T04:34:46.087863055Z INFO 07-24 04:34:46 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
2024-07-24T04:34:56.090458203Z INFO 07-24 04:34:56 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.

hello @kharvd , I have the same issue with the new vllm version 0.5.3post1 and I saw 2 other issues has been created due to that : #6734 and #6732

crowsonkb commented 1 month ago

405B-FP8 base worked well for me on 8xH100 after changing the config (thanks @crowsonkb!), but at some point it started throwing CUDA errors: hello @kharvd , I have the same issue with the new vllm version 0.5.3post1 and I saw 2 other issues has been created due to that : #6734 and #6732

I'm having a similar issue...

Error: Failed to initialize the TMA descriptor 700                                                              
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: CUD
A error: an illegal memory access was encountered

etc.

Edited: If I don't disable vLLM's custom all reduce it works on prompts that previously failed. I haven't seen this error again yet.

RonanKMcGovern commented 1 month ago

Is hf_transfer still working? Weight download seems to be quite slow...

omarsou commented 1 month ago

405B-FP8 base worked well for me on 8xH100 after changing the config (thanks @crowsonkb!), but at some point it started throwing CUDA errors: hello @kharvd , I have the same issue with the new vllm version 0.5.3post1 and I saw 2 other issues has been created due to that : #6734 and #6732

I'm having a similar issue...
Error: Failed to initialize the TMA descriptor 700                                                              
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: CUD
A error: an illegal memory access was encountered
etc.

Edited: If I don't disable vLLM's custom all reduce it works on prompts that previously failed. I haven't seen this error again yet.

Hello @crowsonkb , what do you mean by disanling vLLM’s custom all reduce ?

sumukshashidhar commented 1 month ago

@RonanKMcGovern might be an issue with HF's ability to scale since everyone is rushing to get the weights for the first time

tlrmchlsmth commented 1 month ago

405B-FP8 base worked well for me on 8xH100 after changing the config (thanks @crowsonkb!), but at some point it started throwing CUDA errors: hello @kharvd , I have the same issue with the new vllm version 0.5.3post1 and I saw 2 other issues has been created due to that : #6734 and #6732

I'm having a similar issue...
Error: Failed to initialize the TMA descriptor 700                                                              
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: CUD
A error: an illegal memory access was encountered
etc.

Edited: If I don't disable vLLM's custom all reduce it works on prompts that previously failed. I haven't seen this error again yet.

I am trying to repro this issue, so any information on it would be helpful. Are you using chunked prefill? And could you share your collect-env.py output?

comaniac commented 1 month ago

We saw the same TMA descriptor 700 error as well. Our configuration is:

TP=8 on H100.
Chunked prefill with 2048 chunk size.

The error can be avoided if I manually disabled CUTLASS FP8 kernel for fbgemm, but most importantly, we cannot easily reproduce this error as it happens randomly.

jbohnslav commented 1 month ago

I also got this

[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

CatherineSue commented 1 month ago

We saw the same TMA descriptor 700 error as well. Our configuration is:

TP=8 on H100.

Chunked prefill with 2048 chunk size.

The error can be avoided if I manually disabled CUTLASS FP8 kernel for fbgemm, but most importantly, we cannot easily reproduce this error as it happens randomly.

@comaniac How did you manually disable the CUTLASS FP8 kernel?

comaniac commented 1 month ago

@comaniac How did you manually disable the CUTLASS FP8 kernel?

I just changed this to False https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/fbgemm_fp8.py#L148

vllm-project / vllm

[Model] Meta Llama 3.1 Know Issues & FAQ #6689

Please checkout Announcing Llama 3.1 Support in vLLM