Open nickandbro opened 4 months ago
Conversion from/to f8e4m3nv is only supported on compute capability >= 90
L40S should have compute capability == 89, you need to use H100 for inference with fp8 models.
Thanks!
@nickandbro you can try uninstalling your triton and using the triton nightly, directions here https://github.com/triton-lang/triton?tab=readme-ov-file#quick-installation
Currently the triton 2.3 that we require due to PyTorch cannot support Ada Lovelace, but future releases will.
@mgoin Thanks! Kinda new to Triton, is that a custom kernel that sits ontop of cuda that vllm uses? If so, I believe all I need to do is just swap out the building of the kernel with this using the nightly: https://llvm.org/docs/CMake.html
If I could do FP8 using my own ada hardware, that would be legendary.
Triton isn't a custom kernel in itself, but a library for JITing kernels at runtime. So all you need to do is upgrade the python package that is installed. After installing vllm, try uninstalling triton and installing to a newer version or the nightly to see if they have resolved this issue.
@mgoin im gettign the same error but with mixtral 8x7b in fp16 with 8xL4 GPUs. I also tried installing Triton from source but that didn't work either.
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
INFO 08-24 11:03:27 api_server.py:339] vLLM API server version 0.5.4
INFO 08-24 11:03:27 api_server.py:340] args: Namespace(model_tag='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', host='0.0.0.0', port=1234, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=32, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['mixtral-8x7b-instruct-v0.1'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fe5bd9216c0>)
WARNING 08-24 11:03:27 config.py:1454] Casting torch.bfloat16 to torch.float16.
2024-08-24 11:03:43,771 INFO worker.py:1772 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
INFO 08-24 11:03:51 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='/local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=mixtral-8x7b-instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-24 11:03:51 ray_gpu_executor.py:117] use_ray_spmd_worker: False
INFO 08-24 11:03:51 ray_gpu_executor.py:120] driver_ip: 10.168.76.49
INFO 08-24 11:05:00 utils.py:841] Found nccl from library libnccl.so.2
INFO 08-24 11:05:00 pynccl.py:63] vLLM is using nccl==2.22.3
(RayWorkerWrapper pid=35540) INFO 08-24 11:05:00 utils.py:841] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=35540) INFO 08-24 11:05:00 pynccl.py:63] vLLM is using nccl==2.22.3
WARNING 08-24 11:05:01 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 08-24 11:05:01 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fe5072ccb10>, local_subscribe_port=33399, remote_subscribe_port=None)
INFO 08-24 11:05:01 model_runner.py:720] Starting to load model /local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1...
(RayWorkerWrapper pid=35540) WARNING 08-24 11:05:01 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerWrapper pid=35540) INFO 08-24 11:05:01 model_runner.py:720] Starting to load model /local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1...
Loading safetensors checkpoint shards: 0% Completed | 0/19 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 5% Completed | 1/19 [00:00<00:02, 7.54it/s]
Loading safetensors checkpoint shards: 11% Completed | 2/19 [00:00<00:02, 5.93it/s]
Loading safetensors checkpoint shards: 16% Completed | 3/19 [00:00<00:02, 5.38it/s]
Loading safetensors checkpoint shards: 21% Completed | 4/19 [00:00<00:02, 5.10it/s]
Loading safetensors checkpoint shards: 26% Completed | 5/19 [00:00<00:02, 4.99it/s]
Loading safetensors checkpoint shards: 32% Completed | 6/19 [00:01<00:02, 5.01it/s]
Loading safetensors checkpoint shards: 37% Completed | 7/19 [00:01<00:02, 4.96it/s]
Loading safetensors checkpoint shards: 42% Completed | 8/19 [00:01<00:02, 4.90it/s]
Loading safetensors checkpoint shards: 47% Completed | 9/19 [00:01<00:02, 4.89it/s]
Loading safetensors checkpoint shards: 53% Completed | 10/19 [00:01<00:01, 4.92it/s]
Loading safetensors checkpoint shards: 58% Completed | 11/19 [00:02<00:01, 4.87it/s]
Loading safetensors checkpoint shards: 63% Completed | 12/19 [00:02<00:01, 4.77it/s]
Loading safetensors checkpoint shards: 68% Completed | 13/19 [00:02<00:01, 4.78it/s]
Loading safetensors checkpoint shards: 74% Completed | 14/19 [00:02<00:01, 4.81it/s]
Loading safetensors checkpoint shards: 79% Completed | 15/19 [00:03<00:00, 4.93it/s]
Loading safetensors checkpoint shards: 84% Completed | 16/19 [00:03<00:00, 4.96it/s]
Loading safetensors checkpoint shards: 89% Completed | 17/19 [00:03<00:00, 4.96it/s]
Loading safetensors checkpoint shards: 95% Completed | 18/19 [00:03<00:00, 4.96it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:03<00:00, 4.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:03<00:00, 4.98it/s]
INFO 08-24 11:05:05 model_runner.py:732] Loading model weights took 10.8853 GB
(RayWorkerWrapper pid=35540) INFO 08-24 11:05:16 model_runner.py:732] Loading model weights took 10.8853 GB
(RayWorkerWrapper pid=36687) INFO 08-24 11:05:00 utils.py:841] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=36687) INFO 08-24 11:05:00 pynccl.py:63] vLLM is using nccl==2.22.3 [repeated 6x across cluster]
(RayWorkerWrapper pid=36687) WARNING 08-24 11:05:01 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 6x across cluster]
(RayWorkerWrapper pid=36687) INFO 08-24 11:05:01 model_runner.py:720] Starting to load model /local_disk0/mistralai/Mixtral-8x7B-Instruct-v0.1... [repeated 6x across cluster]
(RayWorkerWrapper pid=36687) *** SIGSEGV received at time=1724497518 on cpu 99 ***
(RayWorkerWrapper pid=36687) PC: @ 0x5266a0 (unknown) (unknown)
(RayWorkerWrapper pid=36687) @ 0x7fb005816520 47342376 (unknown)
(RayWorkerWrapper pid=36687) @ 0x7fafac8ad900 (unknown) (unknown)
(RayWorkerWrapper pid=36687) @ 0x95e040 (unknown) (unknown)
(RayWorkerWrapper pid=36687) [2024-08-24 11:05:18,199 E 36687 36687] logging.cc:440: *** SIGSEGV received at time=1724497518 on cpu 99 ***
(RayWorkerWrapper pid=36687) [2024-08-24 11:05:18,202 E 36687 36687] logging.cc:440: PC: @ 0x5266a0 (unknown) (unknown)
(RayWorkerWrapper pid=36687) [2024-08-24 11:05:18,202 E 36687 36687] logging.cc:440: @ 0x7fb005816520 47342376 (unknown)
(RayWorkerWrapper pid=36687) [2024-08-24 11:05:18,205 E 36687 36687] logging.cc:440: @ 0x7fafac8ad900 (unknown) (unknown)
(RayWorkerWrapper pid=36687) [2024-08-24 11:05:18,211 E 36687 36687] logging.cc:440: @ 0x95e040 (unknown) (unknown)
(RayWorkerWrapper pid=36687) Fatal Python error: Segmentation fault
(RayWorkerWrapper pid=36687)
(RayWorkerWrapper pid=36687) Stack (most recent call first):
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 223 in __init__
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1069 in call_JitFunction
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1109 in visit_Call
(RayWorkerWrapper pid=36687) File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in <listcomp>
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in visit_For
(RayWorkerWrapper pid=36687) File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 351 in visit_compound_statement
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 443 in visit_FunctionDef
(RayWorkerWrapper pid=36687) File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36687) File "/usr/lib/python3.11/ast.py", line 418 in generic_visit
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 359 in visit_Module
(RayWorkerWrapper pid=36687) File "/usr/lib/python3.11/ast.py", line 410 in visit
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1297 in ast_to_ttir
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113 in make_ir
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276 in compile
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/runtime/jit.py", line 662 in run
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/runtime/jit.py", line 345 in <lambda>
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 246 in invoke_fused_moe_kernel
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 531 in fused_experts
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 613 in fused_moe
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 92 in forward_cuda
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/custom_op.py", line 13 in forward
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 75 in apply
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 250 in forward
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 100 in forward
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 243 in forward
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 296 in forward
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 374 in forward
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1363 in execute_model
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 940 in profile_run
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/worker.py", line 179 in determine_num_available_blocks
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 378 in execute_method
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/ray/util/tracing/tracing_helper.py", line 467 in _resume_span
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/ray/_private/function_manager.py", line 691 in actor_method_executor
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/ray/_private/worker.py", line 887 in main_loop
(RayWorkerWrapper pid=36687) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/ray/_private/workers/default_worker.py", line 289 in <module>
(RayWorkerWrapper pid=36687)
(RayWorkerWrapper pid=36687) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, simplejson._speedups, uvloop.loop, ray._raylet, pvectorc, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, PIL._imaging, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, cuda_utils, __triton_launcher (total: 117)
*** SIGSEGV received at time=1724497518 on cpu 60 ***
PC: @ 0x5266a0 (unknown) (unknown)
@ 0x7fe5bdf5d520 (unknown) (unknown)
@ 0x7fe44722dc00 (unknown) (unknown)
@ 0x95e040 (unknown) (unknown)
[2024-08-24 11:05:18,355 E 11842 11842] logging.cc:440: *** SIGSEGV received at time=1724497518 on cpu 60 ***
[2024-08-24 11:05:18,359 E 11842 11842] logging.cc:440: PC: @ 0x5266a0 (unknown) (unknown)
[2024-08-24 11:05:18,359 E 11842 11842] logging.cc:440: @ 0x7fe5bdf5d520 (unknown) (unknown)
[2024-08-24 11:05:18,362 E 11842 11842] logging.cc:440: @ 0x7fe44722dc00 (unknown) (unknown)
[2024-08-24 11:05:18,369 E 11842 11842] logging.cc:440: @ 0x95e040 (unknown) (unknown)
Fatal Python error: Segmentation fault
Stack (most recent call first):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 223 in __init__
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1069 in call_JitFunction
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1109 in visit_Call
File "/usr/lib/python3.11/ast.py", line 410 in visit
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in <listcomp>
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 897 in visit_For
File "/usr/lib/python3.11/ast.py", line 410 in visit
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 351 in visit_compound_statement
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 443 in visit_FunctionDef
File "/usr/lib/python3.11/ast.py", line 410 in visit
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
File "/usr/lib/python3.11/ast.py", line 418 in generic_visit
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 359 in visit_Module
File "/usr/lib/python3.11/ast.py", line 410 in visit
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1204 in visit
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1297 in ast_to_ttir
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113 in make_ir
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276 in compile
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/runtime/jit.py", line 662 in run
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/triton/runtime/jit.py", line 345 in <lambda>
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 246 in invoke_fused_moe_kernel
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 531 in fused_experts
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 613 in fused_moe
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 92 in forward_cuda
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/custom_op.py", line 13 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 75 in apply
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 250 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 100 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 243 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 296 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 374 in forward
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1363 in execute_model
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 940 in profile_run
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/worker.py", line 179 in determine_num_available_blocks
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 378 in execute_method
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 372 in _run_workers
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 362 in _initialize_kv_caches
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 263 in __init__
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 552 in _init_engine
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 381 in __init__
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471 in from_engine_args
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25 in __init__
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217 in run_rpc_server
File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19 in __init__
File "/usr/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
File "/usr/lib/python3.11/multiprocessing/context.py", line 224 in _Popen
File "/usr/lib/python3.11/multiprocessing/process.py", line 121 in start
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 115 in build_async_engine_client
File "/usr/lib/python3.11/contextlib.py", line 204 in __aenter__
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 342 in run_server
File "/usr/lib/python3.11/asyncio/events.py", line 80 in _run
File "/usr/lib/python3.11/asyncio/base_events.py", line 1909 in _run_once
File "/usr/lib/python3.11/asyncio/base_events.py", line 604 in run_forever
File "/usr/lib/python3.11/asyncio/base_events.py", line 637 in run_until_complete
File "/usr/lib/python3.11/asyncio/runners.py", line 120 in run
File "/usr/lib/python3.11/asyncio/runners.py", line 188 in run
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/scripts.py", line 30 in serve
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/lib/python3.11/site-packages/vllm/scripts.py", line 149 in main
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-7f4f8afa-c69f-461e-9b8e-21b54627fc63/bin/vllm", line 8 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, simplejson._speedups, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, pvectorc, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, PIL._imaging, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, snappy._snappy, lz4._version, lz4.frame._frame, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups, ujson, zmq.libzmq, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, grpc._cython.cygrpc, cuda_utils, __triton_launcher (total: 119)
/usr/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Mixtral works when i revert to vllm=0.2.7 (where the fused_moe was not implemented yet)
@nivibilla I am able to load mixtral fp16 (mistralai/Mixtral-8x7B-Instruct-v0.1) just fine with latest release vllm==0.5.5 on 8xL40s
Output
(vllm-rel) ➜ ~ vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 8
INFO 08-27 18:09:03 api_server.py:440] vLLM API server version 0.5.5
INFO 08-27 18:09:03 api_server.py:441] args: Namespace(model_tag='mistralai/Mixtral-8x7B-Instruct-v0.1', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7f9bc2da8280>)
INFO 08-27 18:09:03 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/524d4c7b-0851-4f1d-a71b-a9b8ebc199c7 for RPC Path.
INFO 08-27 18:09:03 api_server.py:161] Started engine process with PID 2193797
INFO 08-27 18:09:07 config.py:813] Defaulting to use mp for distributed inference
INFO 08-27 18:09:07 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=mistralai/Mixtral-8x7B-Instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-27 18:09:07 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 40 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-27 18:09:07 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:10 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2193938) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193937) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193936) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193941) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193939) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193935) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2193940) WARNING 08-27 18:09:10 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 08-27 18:09:10 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fd8bc99f310>, local_subscribe_port=58501, remote_subscribe_port=None)
INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:10 model_runner.py:879] Starting to load model mistralai/Mixtral-8x7B-Instruct-v0.1...
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
INFO 08-27 18:09:11 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/19 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 5% Completed | 1/19 [00:00<00:02, 7.08it/s]
Loading safetensors checkpoint shards: 11% Completed | 2/19 [00:00<00:02, 7.82it/s]
Loading safetensors checkpoint shards: 16% Completed | 3/19 [00:00<00:02, 7.48it/s]
Loading safetensors checkpoint shards: 21% Completed | 4/19 [00:00<00:02, 6.95it/s]
Loading safetensors checkpoint shards: 26% Completed | 5/19 [00:00<00:02, 6.53it/s]
Loading safetensors checkpoint shards: 32% Completed | 6/19 [00:00<00:02, 5.97it/s]
Loading safetensors checkpoint shards: 37% Completed | 7/19 [00:01<00:01, 6.17it/s]
Loading safetensors checkpoint shards: 42% Completed | 8/19 [00:01<00:01, 6.11it/s]
Loading safetensors checkpoint shards: 47% Completed | 9/19 [00:01<00:01, 6.20it/s]
Loading safetensors checkpoint shards: 53% Completed | 10/19 [00:01<00:01, 6.38it/s]
Loading safetensors checkpoint shards: 58% Completed | 11/19 [00:01<00:01, 6.03it/s]
Loading safetensors checkpoint shards: 63% Completed | 12/19 [00:01<00:01, 5.85it/s]
Loading safetensors checkpoint shards: 68% Completed | 13/19 [00:02<00:01, 5.56it/s]
Loading safetensors checkpoint shards: 74% Completed | 14/19 [00:02<00:00, 5.77it/s]
Loading safetensors checkpoint shards: 79% Completed | 15/19 [00:02<00:00, 5.83it/s]
Loading safetensors checkpoint shards: 84% Completed | 16/19 [00:02<00:00, 5.99it/s]
Loading safetensors checkpoint shards: 89% Completed | 17/19 [00:02<00:00, 5.98it/s]
Loading safetensors checkpoint shards: 95% Completed | 18/19 [00:02<00:00, 5.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:03<00:00, 5.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 19/19 [00:03<00:00, 6.01it/s]
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:14 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:15 model_runner.py:890] Loading model weights took 10.8853 GB
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:15 model_runner.py:890] Loading model weights took 10.8853 GB
INFO 08-27 18:09:20 distributed_gpu_executor.py:56] # GPU blocks: 107636, # CPU blocks: 16384
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:21 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:21 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2193937) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2193941) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2193939) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2193935) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2193940) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2193936) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2193938) INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 15 secs.
INFO 08-27 18:09:36 model_runner.py:1300] Graph capturing finished in 15 secs.
INFO 08-27 18:09:37 api_server.py:209] vLLM to use /tmp/tmp4co5wqy1 as PROMETHEUS_MULTIPROC_DIR
WARNING 08-27 18:09:37 serving_embedding.py:188] embedding_mode is False. Embedding API will not work.
INFO 08-27 18:09:37 launcher.py:20] Available routes are:
INFO 08-27 18:09:37 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 08-27 18:09:37 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 08-27 18:09:37 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-27 18:09:37 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 08-27 18:09:37 launcher.py:28] Route: /health, Methods: GET
INFO 08-27 18:09:37 launcher.py:28] Route: /tokenize, Methods: POST
INFO 08-27 18:09:37 launcher.py:28] Route: /detokenize, Methods: POST
INFO 08-27 18:09:37 launcher.py:28] Route: /v1/models, Methods: GET
INFO 08-27 18:09:37 launcher.py:28] Route: /version, Methods: GET
INFO 08-27 18:09:37 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 08-27 18:09:37 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 08-27 18:09:37 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 08-27 18:09:37 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO: Started server process [2193718]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Also the same works for FP8 with neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8!
(vllm-rel) ➜ ~ vllm serve neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 --tensor-parallel-size 8
INFO 08-27 18:13:48 api_server.py:440] vLLM API server version 0.5.5
INFO 08-27 18:13:48 api_server.py:441] args: Namespace(model_tag='neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7f8381a60280>)
INFO 08-27 18:13:48 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/18e76b0e-7b36-4b76-b091-9d9c89bd34ee for RPC Path.
INFO 08-27 18:13:48 api_server.py:161] Started engine process with PID 2195440
INFO 08-27 18:13:52 config.py:813] Defaulting to use mp for distributed inference
INFO 08-27 18:13:52 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8', speculative_config=None, tokenizer='neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-27 18:13:52 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 40 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-27 18:13:52 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:52 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:52 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:52 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:53 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:53 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:53 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:53 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:55 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:55 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2195582) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195585) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195586) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195580) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195578) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195581) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2195579) WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 08-27 18:13:55 custom_all_reduce.py:122] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 08-27 18:13:55 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fc240fd3880>, local_subscribe_port=59205, remote_subscribe_port=None)
INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:55 model_runner.py:879] Starting to load model neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8...
WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195585) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195582) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195586) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195581) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195579) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195580) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=2195578) WARNING 08-27 18:13:55 fp8.py:46] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:56 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/10 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 10% Completed | 1/10 [00:00<00:01, 6.42it/s]
Loading safetensors checkpoint shards: 20% Completed | 2/10 [00:00<00:01, 6.24it/s]
Loading safetensors checkpoint shards: 30% Completed | 3/10 [00:00<00:01, 5.96it/s]
Loading safetensors checkpoint shards: 40% Completed | 4/10 [00:00<00:01, 5.90it/s]
Loading safetensors checkpoint shards: 50% Completed | 5/10 [00:00<00:00, 5.79it/s]
Loading safetensors checkpoint shards: 60% Completed | 6/10 [00:01<00:00, 5.74it/s]
Loading safetensors checkpoint shards: 80% Completed | 8/10 [00:01<00:00, 6.50it/s]
Loading safetensors checkpoint shards: 90% Completed | 9/10 [00:01<00:00, 6.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:01<00:00, 5.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:01<00:00, 6.00it/s]
WARNING 08-27 18:13:57 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
(VllmWorkerProcess pid=2195580) WARNING 08-27 18:13:57 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
(VllmWorkerProcess pid=2195582) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195580) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195585) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
(VllmWorkerProcess pid=2195581) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
(VllmWorkerProcess pid=2195582) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195585) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195578) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
(VllmWorkerProcess pid=2195581) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195579) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
(VllmWorkerProcess pid=2195578) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195586) WARNING 08-27 18:13:58 utils.py:721] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
(VllmWorkerProcess pid=2195579) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
(VllmWorkerProcess pid=2195586) INFO 08-27 18:13:58 model_runner.py:890] Loading model weights took 5.4791 GB
INFO 08-27 18:14:03 distributed_gpu_executor.py:56] # GPU blocks: 129516, # CPU blocks: 16384
(VllmWorkerProcess pid=2195578) INFO 08-27 18:14:04 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195578) INFO 08-27 18:14:04 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195582) INFO 08-27 18:14:04 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195582) INFO 08-27 18:14:04 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195586) INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195586) INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195581) INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195581) INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195579) INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195579) INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195585) INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195585) INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195580) INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2195580) INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-27 18:14:05 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-27 18:14:05 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2195581) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2195582) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2195578) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 16 secs.
(VllmWorkerProcess pid=2195585) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2195580) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2195579) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
(VllmWorkerProcess pid=2195586) INFO 08-27 18:14:20 model_runner.py:1300] Graph capturing finished in 15 secs.
INFO 08-27 18:14:20 api_server.py:209] vLLM to use /tmp/tmphfyfhzvv as PROMETHEUS_MULTIPROC_DIR
WARNING 08-27 18:14:20 serving_embedding.py:188] embedding_mode is False. Embedding API will not work.
INFO 08-27 18:14:20 launcher.py:20] Available routes are:
INFO 08-27 18:14:20 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 08-27 18:14:20 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 08-27 18:14:20 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-27 18:14:20 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 08-27 18:14:20 launcher.py:28] Route: /health, Methods: GET
INFO 08-27 18:14:20 launcher.py:28] Route: /tokenize, Methods: POST
INFO 08-27 18:14:20 launcher.py:28] Route: /detokenize, Methods: POST
INFO 08-27 18:14:20 launcher.py:28] Route: /v1/models, Methods: GET
INFO 08-27 18:14:20 launcher.py:28] Route: /version, Methods: GET
INFO 08-27 18:14:20 launcher.py:28] Route: /v1/chat/completions, Methods: POST
INFO 08-27 18:14:20 launcher.py:28] Route: /v1/completions, Methods: POST
INFO 08-27 18:14:20 launcher.py:28] Route: /v1/embeddings, Methods: POST
INFO 08-27 18:14:20 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO: Started server process [2195363]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
@mgoin thanks for confirming, everything is identical between your setup and mine, apart that i am using L4 gpus and you are using L40s, however one difference i noted is my cluster (im using databricks g6.48x) is on a much older driver 535 whereas you are on 555. Maybe thats the issue. i will raise this with databricks.
However would it worth to add an option to disable fused_moe for cases like this where the driver cannot be updated so easily?
The error is happening in the triton compiler, so it seems like an unenforced requirement from triton. We could try to make a wrapper for this, but we would have to find the minimum driver...
I tried updating the driver, it installed but i get permission denied when i try to reboot the gpu. Databricks is such pain. Im basically stuck with driver 535.161.07
@mgoin i got the same error on A10s aswell, this time with deepseek lite. Same place, the fused_moe kernel
Your current environment
For setup, I am using the version 0.5 and the vllm_openai target as part of the Dockerfile with these arguments:
🐛 Describe the bug
When I load Mixtral-8x22B-Instruct-v0.1-FP8 onto 8 L40S it causes this error:
Any help would be much appreciated!