Closed simon-mo closed 4 days ago
Just me or are other people also having issues running llama 3.1 models? My error:
if rope_scaling is not None and rope_scaling["type"] not in, KeyError: 'type'
.
Config:
llm = LLM(model=MODEL,
tensor_parallel_size=NUM_GPUS,
enable_prefix_caching=False,
gpu_memory_utilization=0.80,
max_model_len=4096,
trust_remote_code=True,
max_num_seqs=16,
)
I think there's some issue with parsing the config for meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
fetched from huggingface:
$ vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8
INFO 07-23 16:04:42 api_server.py:219] vLLM API server version 0.5.3
INFO 07-23 16:04:42 api_server.py:220] args: Namespace(model_tag='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', host=None, port=8000, uvicorn_log_level='info', allow_credential
s=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fee7856c5e0>)
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22.5k/22.5k [00:00<00:00, 85.5MB/s]
Traceback (most recent call last):
File "/opt/conda/bin/vllm", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 148, in main
args.dispatch_function(args)
File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 28, in serve
run_server(args)
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
if llm_engine is not None else AsyncLLMEngine.from_engine_args(
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 457, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 699, in create_engine_config
model_config = ModelConfig(
File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 176, in __init__
self.max_model_len = _get_and_verify_max_len(
File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 1497, in _get_and_verify_max_len
if rope_scaling is not None and rope_scaling["type"] not in {
KeyError: 'type'
Version:
pip freeze | grep vllm
vllm==0.5.3
vllm-flash-attn==2.5.9.post1
I'm having the same issue as @Vaibhav-Sahai
Command:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 2
Error trace:
File "envs/synthenv/lib/python3.10/site-packages/vllm/config.py", line 176, in __init__
self.max_model_len = _get_and_verify_max_len(
File "envs/synthenv/lib/python3.10/site-packages/vllm/config.py", line 1497, in _get_and_verify_max_len
if rope_scaling is not None and rope_scaling["type"] not in {
KeyError: 'type'
Version:
vllm==0.5.3
vllm-flash-attn==2.5.9.post1
[RESOLVED]:
Adding this to the CLI works: --rope-scaling='{"type": "extended", "factor": 8.0}'
Thanks @simon-mo!
Hello there! @romilbhardwaj @arkilpatel Thanks for reporting the issue and we are aware of it. This is due to the fact that HuggingFace decides to rename this key ("rope_type" instead of "type") in the repo of all Llama 3.1 models.
~In CLI or Python, pass in --rope-scaling='{"type": "extended", "factor": 8.0}'
or rope_scaling={"type": "extended", "factor": 8.0}
should get around this for now~
Please update vLLM version to v0.5.3.post1.
@simon-mo maybe i missed something, but getting the following log:
assert "factor" in rope_scaling
AssertionError
LLM config:
llm = LLM(model=MODEL,
tensor_parallel_size=NUM_GPUS,
enable_prefix_caching=False,
gpu_memory_utilization=0.90,
max_model_len=4096,
trust_remote_code=True,
max_num_seqs=16,
rope_scaling={"type": "dummy"},
)
Version:
vllm==0.5.3
EDIT: works when trying rope_scaling={"type": "extended", "factor": 8.0}
. Thanks @simon-mo!
EDIT2: updating to 0.5.4 makes this work without any additional flags. Thank you to all collaborators!
@ywang96 renaming rope_type
to type
does not work either. I was just wanting to run some benchmarks of the AWQ models, but I cannot seem to get it running at the moment due to the expected vs actual parameters for rope in the config.
Any specific solution for this? EDIT: Nevermind, I see #6693
@Vaibhav-Sahai updated my hack, PTAL.
@Vaibhav-Sahai @romilbhardwaj @casper-hansen We fixed the RoPE issue in #6693. The model should work without any extra args now in the main branch.
Thanks @WoosukKwon. Are you guys planning any post release today or should we build from source until 0.5.4?
We plan to release ASAP after confirmation with the HuggingFace team.
Llama 3.1 405B base in fp8 is just generating !!!! over and over for me, is anyone else having this issue? I verified that 70B base works for me (but it's not in fp8).
Regarding the rope issue, the new version has been released with the fix. Please test it out!
Hey is this the same issue?
ValueError: `rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
I am using the following configuration
Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling={'type': 'extended', 'factor': 8.0}, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
As suggested used rope_scaling={"type": "extended", "factor": 8.0}
this with vllm openai server
Please try update the vLLM version and you don't need the rope scaling anymore
@simon-mo, same issue as @akhil-netomi for me. Just updated to post1, I unfortunately get the same issue?
I'm trying the 70B instruct variant - I tried with and without the hotfix. Error is in the _rope_scaling_validation
method
This might be where the issue lies - although I'm not familiar with the codebase
Hey! trying to get Llama3.1-405B-FP8 working with vLLM, getting this error RuntimeError: "fill_emptydeterministic" not implemented for 'Float8_e4m3fn'
I updated to the latest vLLM version and Llama3.1-70B works
@sumukshashidhar are you trying to use chameleon
? That class won't be touched unless you're trying to serve ChameleonForConditionalGeneration
.
It would be great if you can paste the whole stacktrace so we can see if the error is coming from vLLM or transformers.
Llama 3.1 405B base in fp8 is just generating !!!! over and over for me, is anyone else having this issue? I verified that 70B base works for me (but it's not in fp8).
405B instruct FP8 works for me. It's just the base model that is not working. Also, base and instruct seem to use different amounts of GPU memory, which should not happen (base uses less).
@sumukshashidhar @akhil-netomi Please try upgrading transformers
with pip install --upgrade transformers
For those that want to test vLLM with gptq_marlin compatible 4bit quants we have just pushed both 8B models to HF:
https://huggingface.co/ModelCloud/Meta-Llama-3.1-8B-Instruct-gptq-4bit https://huggingface.co/ModelCloud/Meta-Llama-3.1-8B-gptq-4bit https://huggingface.co/ModelCloud/Meta-Llama-3.1-70B-Instruct-gptq-4bit
@ywang96 my bad, I'm not trying to use chameleon, my error is in the normal transformers module. @WoosukKwon yes, upgrading transformers fixes it. Thanks!
Just sharing - vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --max-model-len 4096
works well on A100-80GB:8 with v0.5.3.post1.
I'm also able to run the 8B model on L4:1 with vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 1 --max-model-len 1024
.
Thanks for the great work vllm team!
Also posting on AWQ: the latest patch fixed it and benchmark now runs.
python benchmarks/benchmark_throughput.py \
--input-len 512 \
--output-len 256 \
--model hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
--quantization marlin \
-tp 4 \
--num-prompts 100 \
--max-model-len 1024 \
--dtype half
@romilbhardwaj @casper-hansen Thanks for sharing! Could you please share why you limit max-mode-len
? Did you experience OOMs without it?
I no longer have the error on hand, but it was along the lines of [rank0]: ValueError: The model's max seq len (...) is larger than the maximum number of tokens that can be stored in KV cache (...) ...
QQ, what is good choice for max-model-len for 8*H100 for 405B?
Should I just find out? If anyone has suggestions welcome. That is, 128k is default, I presume.
But should I go for 16k, 32k, 64k, 128k? What can 8*H100 support at full context length?
Llama 3.1 405B base in fp8 is just generating !!!! over and over for me, is anyone else having this issue? I verified that 70B base works for me (but it's not in fp8).
405B instruct FP8 works for me. It's just the base model that is not working. Also, base and instruct seem to use different amounts of GPU memory, which should not happen (base uses less).
I figured this out. The 405B FP8 base model config.json has a list of "modules_to_not_convert" that is different from the instruct model. vllm apparently does not run submodules of modules in this list as 16-bit, so I pasted the list from the instruct model config.json, which includes all of the submodules, into the base model config.json and this fixed the issue where I was getting NaNs in inference.
@romilbhardwaj @casper-hansen Thanks for sharing! Could you please share why you limit
max-mode-len
? Did you experience OOMs without it?
I limited the sequence length just for testing/benchmarking purposes. If my memory serves me correctly, I don't believe it can load 128k context length on the setup of 4x H100 that I was running earlier.
Hi team,I am trying deploy fp8 llama3.1 405b model and running into below issue ..any help is appreciated
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 [rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. [rank0]: Last error: [rank0]: Error while creating shared memory segment /dev/shm/nccl-eQz2sT (size 5767520) ERROR 07-23 22:46:18 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 82 died, exit code: -15 INFO 07-23 22:46:18 multiproc_worker_utils.py:123] Killing local vLLM worker processes [rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] /usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
@nani1149
Error while creating shared memory segment /dev/shm/nccl-eQz2sT (size 5767520)
you don't give enough size for the shared memory. if you are using docker, please see https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html for how to adjust shm size.
(vllm-env) alex@4a100:~/issuegpt$ python -m vllm.entrypoints.openai.api_server
--model meta-llama/Meta-Llama-3.1-70B \
--tensor-parallel-size 4 \
--api-key eyJhIjoiYmI5ZW \
--gpu-memory-utilization 0.95 \
--rope-scaling='{"type": "extended", "factor": 8.0}'
INFO 07-23 16:04:35 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 07-23 16:04:35 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-23 16:04:36 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=facebook/opt-125m)
INFO 07-23 16:04:37 weight_utils.py:218] Using model weights format ['*.bin']
INFO 07-23 16:04:37 model_runner.py:160] Loading model weights took 0.2389 GB
INFO 07-23 16:04:37 gpu_executor.py:83] # GPU blocks: 63284, # CPU blocks: 7281
INFO 07-23 16:04:40 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-23 16:04:40 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-23 16:04:49 model_runner.py:965] Graph capturing finished in 9 secs.
WARNING 07-23 16:04:49 serving_chat.py:95] No chat template provided. Chat API will not work.
WARNING 07-23 16:04:50 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO: Started server process [1200817]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 07-23 16:05:00 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Above command work for me, but it result a very strange model name, "facebook/opt-125m", and the answer from this model is very strange and crazy. Anyone has the same problem?
@alexchenyu you need \
in the end of the first line
@alexchenyu you need
\
in the end of the first line
Oh, my bad, thanks for your reply.
I'm currently using 0.4.2, is this version support llama3.1? We only need a deployable version for demo.
405B-FP8 base worked well for me on 8xH100 after changing the config (thanks @crowsonkb!), but at some point it started throwing CUDA errors:
2024-07-24T04:34:21.448550002Z Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-07-24T04:34:21.448553024Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.448555571Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70653366fb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.448558757Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x706533797718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
2024-07-24T04:34:21.448561932Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7065349948e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448564812Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7065349989e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448567287Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x70653499e05c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448584270Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x70653499edcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.448587724Z frame #7: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-07-24T04:34:21.448590886Z frame #8: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
2024-07-24T04:34:21.448593376Z frame #9: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-07-24T04:34:21.448596260Z
2024-07-24T04:34:21.449893760Z terminate called after throwing an instance of 'c10::DistBackendError'
2024-07-24T04:34:21.449947144Z what(): [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
2024-07-24T04:34:21.449954579Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-07-24T04:34:21.449960388Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-07-24T04:34:21.449966439Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-07-24T04:34:21.449971923Z
2024-07-24T04:34:21.449977084Z Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
2024-07-24T04:34:21.449982622Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.449988402Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70653366fb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.449994893Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x706533797718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
2024-07-24T04:34:21.450000848Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7065349948e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450006518Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7065349989e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450012246Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x70653499e05c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450017501Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x70653499edcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450022907Z frame #7: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-07-24T04:34:21.450028932Z frame #8: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
2024-07-24T04:34:21.450034511Z frame #9: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-07-24T04:34:21.450039916Z
2024-07-24T04:34:21.450047055Z Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
2024-07-24T04:34:21.450052480Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-07-24T04:34:21.450058156Z frame #1: <unknown function> + 0xe32119 (0x706534622119 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-07-24T04:34:21.450062761Z frame #2: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-07-24T04:34:21.450068309Z frame #3: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
2024-07-24T04:34:21.450073485Z frame #4: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-07-24T04:34:21.450078404Z
2024-07-24T04:34:26.082870754Z INFO 07-24 04:34:26 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
2024-07-24T04:34:27.165587884Z ERROR 07-24 04:34:27 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 83 died, exit code: -6
2024-07-24T04:34:27.165637073Z INFO 07-24 04:34:27 multiproc_worker_utils.py:123] Killing local vLLM worker processes
2024-07-24T04:34:36.085684429Z INFO 07-24 04:34:36 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
2024-07-24T04:34:46.087863055Z INFO 07-24 04:34:46 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
2024-07-24T04:34:56.090458203Z INFO 07-24 04:34:56 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
how to deploy fp16 405B model by using vllm?
Got this:
UNAVAILABLE: Internal: RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request .
I'm using vllm 0.5.3-post1 and Meta-Llama-3.1-405B-Instruct-FP8
@kharvd How many TPS did you get? ^^
405B-FP8 base worked well for me on 8xH100 after changing the config (thanks @crowsonkb!), but at some point it started throwing CUDA errors:
2024-07-24T04:34:21.448550002Z Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): 2024-07-24T04:34:21.448553024Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 2024-07-24T04:34:21.448555571Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70653366fb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 2024-07-24T04:34:21.448558757Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x706533797718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) 2024-07-24T04:34:21.448561932Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7065349948e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-07-24T04:34:21.448564812Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7065349989e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-07-24T04:34:21.448567287Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x70653499e05c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-07-24T04:34:21.448584270Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x70653499edcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-07-24T04:34:21.448587724Z frame #7: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) 2024-07-24T04:34:21.448590886Z frame #8: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) 2024-07-24T04:34:21.448593376Z frame #9: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-07-24T04:34:21.448596260Z 2024-07-24T04:34:21.449893760Z terminate called after throwing an instance of 'c10::DistBackendError' 2024-07-24T04:34:21.449947144Z what(): [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered 2024-07-24T04:34:21.449954579Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2024-07-24T04:34:21.449960388Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 2024-07-24T04:34:21.449966439Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 2024-07-24T04:34:21.449971923Z 2024-07-24T04:34:21.449977084Z Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): 2024-07-24T04:34:21.449982622Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 2024-07-24T04:34:21.449988402Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70653366fb25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 2024-07-24T04:34:21.449994893Z frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x706533797718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so) 2024-07-24T04:34:21.450000848Z frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7065349948e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-07-24T04:34:21.450006518Z frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7065349989e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-07-24T04:34:21.450012246Z frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x70653499e05c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-07-24T04:34:21.450017501Z frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x70653499edcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-07-24T04:34:21.450022907Z frame #7: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) 2024-07-24T04:34:21.450028932Z frame #8: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) 2024-07-24T04:34:21.450034511Z frame #9: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-07-24T04:34:21.450039916Z 2024-07-24T04:34:21.450047055Z Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): 2024-07-24T04:34:21.450052480Z frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7065336bf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) 2024-07-24T04:34:21.450058156Z frame #1: <unknown function> + 0xe32119 (0x706534622119 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) 2024-07-24T04:34:21.450062761Z frame #2: <unknown function> + 0xd6df4 (0x706580455df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) 2024-07-24T04:34:21.450068309Z frame #3: <unknown function> + 0x8609 (0x706581517609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0) 2024-07-24T04:34:21.450073485Z frame #4: clone + 0x43 (0x706581651353 in /usr/lib/x86_64-linux-gnu/libc.so.6) 2024-07-24T04:34:21.450078404Z 2024-07-24T04:34:26.082870754Z INFO 07-24 04:34:26 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. 2024-07-24T04:34:27.165587884Z ERROR 07-24 04:34:27 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 83 died, exit code: -6 2024-07-24T04:34:27.165637073Z INFO 07-24 04:34:27 multiproc_worker_utils.py:123] Killing local vLLM worker processes 2024-07-24T04:34:36.085684429Z INFO 07-24 04:34:36 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. 2024-07-24T04:34:46.087863055Z INFO 07-24 04:34:46 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%. 2024-07-24T04:34:56.090458203Z INFO 07-24 04:34:56 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
hello @kharvd , I have the same issue with the new vllm version 0.5.3post1 and I saw 2 other issues has been created due to that : #6734 and #6732
405B-FP8 base worked well for me on 8xH100 after changing the config (thanks @crowsonkb!), but at some point it started throwing CUDA errors: hello @kharvd , I have the same issue with the new vllm version 0.5.3post1 and I saw 2 other issues has been created due to that : #6734 and #6732
I'm having a similar issue...
Error: Failed to initialize the TMA descriptor 700
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: CUD
A error: an illegal memory access was encountered
etc.
Edited: If I don't disable vLLM's custom all reduce it works on prompts that previously failed. I haven't seen this error again yet.
Is hf_transfer still working? Weight download seems to be quite slow...
405B-FP8 base worked well for me on 8xH100 after changing the config (thanks @crowsonkb!), but at some point it started throwing CUDA errors: hello @kharvd , I have the same issue with the new vllm version 0.5.3post1 and I saw 2 other issues has been created due to that : #6734 and #6732
I'm having a similar issue...
Error: Failed to initialize the TMA descriptor 700 [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: CUD A error: an illegal memory access was encountered
etc.
Edited: If I don't disable vLLM's custom all reduce it works on prompts that previously failed. I haven't seen this error again yet.
Hello @crowsonkb , what do you mean by disanling vLLM’s custom all reduce ?
@RonanKMcGovern might be an issue with HF's ability to scale since everyone is rushing to get the weights for the first time
405B-FP8 base worked well for me on 8xH100 after changing the config (thanks @crowsonkb!), but at some point it started throwing CUDA errors: hello @kharvd , I have the same issue with the new vllm version 0.5.3post1 and I saw 2 other issues has been created due to that : #6734 and #6732
I'm having a similar issue...
Error: Failed to initialize the TMA descriptor 700 [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: CUD A error: an illegal memory access was encountered
etc.
Edited: If I don't disable vLLM's custom all reduce it works on prompts that previously failed. I haven't seen this error again yet.
I am trying to repro this issue, so any information on it would be helpful. Are you using chunked prefill? And could you share your collect-env.py output?
We saw the same TMA descriptor 700 error as well. Our configuration is:
The error can be avoided if I manually disabled CUTLASS FP8 kernel for fbgemm, but most importantly, we cannot easily reproduce this error as it happens randomly.
I also got this
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
We saw the same TMA descriptor 700 error as well. Our configuration is:
- TP=8 on H100.
- Chunked prefill with 2048 chunk size.
The error can be avoided if I manually disabled CUTLASS FP8 kernel for fbgemm, but most importantly, we cannot easily reproduce this error as it happens randomly.
@comaniac How did you manually disable the CUTLASS FP8 kernel?
@comaniac How did you manually disable the CUTLASS FP8 kernel?
I just changed this to False https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/fbgemm_fp8.py#L148
Please checkout Announcing Llama 3.1 Support in vLLM
--enable-chunked-prefill=false
then optionally combine it with--max-model-len=4096
if turning it out cause OOM. You can change the length for the context window you desired.if rope_scaling is not None and rope_scaling["type"] not in, KeyError: 'type'.
ValueError: 'rope_scaling' must be a dictionary with two fields, 'type' and 'factor', got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
pip install transformers --upgrade
)UPDATE:
meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
model repository has been fixed with the correct number of kv heads. Please try launching with default vLLM args and the updated model weights!