Open Techinix opened 7 months ago
Any error message or repro?
+1 Same issue. No error message. The log just says:
INFO 04-24 14:55:32 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
same here: Cannot use FlashAttention backend because the flash_attn package is not found
using the pytorch image (nvcr.io/nvidia/pytorch:24.03-py3
) with this script:
export HOME=/dev/shm/
export PATH=$HOME/.local/bin/:$PATH
pip install vllm==v0.4.1
ray disable-usage-stats
ray stop --force
export OPENBLAS_NUM_THREADS=32
export OMP_NUM_THREADS=32
pkill python
ray start --head --num-cpus=32 --num-gpus=8
python -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x22B-Instruct-v0.1 --dtype float16 --tensor-parallel-size 8
I actually succcessfully installed flash-attn 2.5.7 with vllm 0.4.1 and it can be detected with vllm (Using FlashAttention backend). But the performance is remaining the same (there is not any speed improvement or anything else).
Actually it only breaks when I upgrade from 0.4.0.post1 to 0.4.1 , i think it has something to do with one of these bad boys : Installing collected packages: triton, nvidia-nccl-cu12, torch, xformers, vllm Attempting uninstall: triton Found existing installation: triton 2.2.0 Uninstalling triton-2.2.0: Successfully uninstalled triton-2.2.0 Attempting uninstall: nvidia-nccl-cu12 Found existing installation: nvidia-nccl-cu12 2.19.3 Uninstalling nvidia-nccl-cu12-2.19.3: Successfully uninstalled nvidia-nccl-cu12-2.19.3 Attempting uninstall: torch Found existing installation: torch 2.2.1 Uninstalling torch-2.2.1: Successfully uninstalled torch-2.2.1 Attempting uninstall: xformers Found existing installation: xformers 0.0.25 Uninstalling xformers-0.0.25: Successfully uninstalled xformers-0.0.25 Attempting uninstall: vllm Found existing installation: vllm 0.4.1 Uninstalling vllm-0.4.1: Successfully uninstalled vllm-0.4.1 Successfully installed nvidia-nccl-cu12-2.18.1 torch-2.1.2 triton-2.1.0 vllm-0.4.0.post1 xformers-0.0.23.post1
Flash attention stops working totally even when importing it (flash_attn)
Same problem for me, with 2.5.7 installed. I came back to 0.4.0.post1
I actually succcessfully installed flash-attn 2.5.7 with vllm 0.4.1 and it can be detected with vllm (Using FlashAttention backend). But the performance is remaining the same (there is not any speed improvement or anything else).
Same problem
Same
Is the problem still there? I have the same speed with 0.4.0.post1 and 0.4.1. In both cases I have flash_attn installed and in both cases when openai server starts it detects Flash Attention:
INFO 05-05 13:49:23 selector.py:28] Using FlashAttention backend.
But this is the same speed as without FlashAttention (Using XFormers backend)
I actually succcessfully installed flash-attn 2.5.7 with vllm 0.4.1 and it can be detected with vllm (Using FlashAttention backend). But the performance is remaining the same (there is not any speed improvement or anything else).
Same problem
same problem
I am experiencing the same issue. I tested with vLLM v0.4.1 and flash_attn v2.5.7. I also tested with the GQA and MHA models, both with and without Tensor Parallelism, and with input lengths of 1024, 2048, 4096, 8192, and 16384. However, the result is the same.
Hack, I have this too:
INFO 06-25 06:57:45 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-25 06:57:45 selector.py:51] Using XFormers backend.
INFO 06-25 06:57:48 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-25 06:57:48 selector.py:51] Using XFormers backend.
INFO 06-25 06:57:49 weight_utils.py:207] Using model weights format ['*.safetensors']
INFO 06-25 06:57:49 weight_utils.py:250] No model.safetensors.index.json found in remote.
INFO 06-25 06:58:04 model_runner.py:146] Loading model weights took 3.6769 GB
INFO 06-25 06:58:09 gpu_executor.py:83] # GPU blocks: 1044, # CPU blocks: 512
INFO 06-25 06:58:11 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-25 06:58:11 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-25 06:58:21 model_runner.py:924] Graph capturing finished in 10 secs.
INFO: Started server process [41253]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
INFO 06-25 06:58:46 async_llm_engine.py:553] Received request 28c0b20f7f39
Currently I am getting the following speed:
INFO 06-25 06:58:47 metrics.py:341] Avg prompt throughput: 45.2 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.1%, CPU KV cache usage: 0.0%.
INFO 06-25 06:58:52 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.6%, CPU KV cache usage: 0.0%.
INFO 06-25 06:58:57 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 48.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 10.1%, CPU KV cache usage: 0.0%.
Do you think if flash attention is there, I can improve my generation speed? Thanks!
I actually succcessfully installed flash-attn 2.5.7 with vllm 0.4.1 and it can be detected with vllm (Using FlashAttention backend). But the performance is remaining the same (there is not any speed improvement or anything else).
Hi @ruifengma, I'm not sure if this answered your issue. https://github.com/vllm-project/vllm/issues/485#issuecomment-1693009046
Running into the same problem:
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=xxx" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model openbmb/MiniCPM-V-2_6 \
--trust-remote-code
gives
INFO 08-12 10:00:18 api_server.py:339] vLLM API server version 0.5.4
INFO 08-12 10:00:18 api_server.py:340] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='openbmb/MiniCPM-V-2_6', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 370, in <module>
asyncio.run(run_server(args))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 342, in run_server
async with build_async_engine_client(args) as async_engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
if (model_is_embedding(args.model, args.trust_remote_code)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 64, in model_is_embedding
return ModelConfig(model=model_name,
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 158, in __init__
self.hf_config = get_config(self.model, trust_remote_code, revision,
File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 46, in get_config
config = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 981, in from_pretrained
config_class = get_class_from_dynamic_module(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 502, in get_class_from_dynamic_module
final_module = get_cached_module_file(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 365, in get_cached_module_file
get_cached_module_file(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 327, in get_cached_module_file
modules_needed = check_imports(resolved_module_file)
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 182, in check_imports
raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run `pip install flash_attn`
same things with skypilot deploy on aws :
INFO 08-26 14:56:26 api_server.py:441] args: Namespace(host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='test', lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='openbmb/MiniCPM-V-2_6', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
(sky-service-97f7, pid=29148) Traceback (most recent call last):
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/runpy.py", line 197, in _run_module_as_main
(sky-service-97f7, pid=29148) return _run_code(code, main_globals, None,
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/runpy.py", line 87, in _run_code
(sky-service-97f7, pid=29148) exec(code, run_globals)
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 476, in <module>
(sky-service-97f7, pid=29148) asyncio.run(run_server(args))
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/asyncio/runners.py", line 44, in run
(sky-service-97f7, pid=29148) return loop.run_until_complete(main)
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
(sky-service-97f7, pid=29148) return future.result()
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 443, in run_server
(sky-service-97f7, pid=29148) async with build_async_engine_client(args) as async_engine_client:
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/contextlib.py", line 181, in __aenter__
(sky-service-97f7, pid=29148) return await self.gen.__anext__()
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 117, in build_async_engine_client
(sky-service-97f7, pid=29148) if (model_is_embedding(args.model, args.trust_remote_code,
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 71, in model_is_embedding
(sky-service-97f7, pid=29148) return ModelConfig(model=model_name,
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/config.py", line 169, in __init__
(sky-service-97f7, pid=29148) self.hf_config = get_config(self.model, trust_remote_code, revision,
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/transformers_utils/config.py", line 64, in get_config
(sky-service-97f7, pid=29148) config = AutoConfig.from_pretrained(
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 985, in from_pretrained
(sky-service-97f7, pid=29148) config_class = get_class_from_dynamic_module(
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 502, in get_class_from_dynamic_module
(sky-service-97f7, pid=29148) final_module = get_cached_module_file(
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 365, in get_cached_module_file
(sky-service-97f7, pid=29148) get_cached_module_file(
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 327, in get_cached_module_file
(sky-service-97f7, pid=29148) modules_needed = check_imports(resolved_module_file)
(sky-service-97f7, pid=29148) File "/opt/conda/envs/vllm/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 182, in check_imports
(sky-service-97f7, pid=29148) raise ImportError(
(sky-service-97f7, pid=29148) ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run `pip install flash_attn`
ERROR: Job 1 failed with return code list: [1]
Your current environment
2024-04-24 06:04:07 (27.2 MB/s) - ‘collect_env.py’ saved [24877/24877]
Collecting environment information... PyTorch version: 2.2.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.2 Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe Nvidia driver version: 550.54.15 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: AuthenticAMD Model name: AMD EPYC 74F3 24-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 1 Stepping: 1 BogoMIPS: 6387.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 512 KiB (8 instances) L1i cache: 512 KiB (8 instances) L2 cache: 4 MiB (8 instances) L3 cache: 128 MiB (8 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.19.3 [pip3] torch==2.2.1 [pip3] torchsummary==1.5.1 [pip3] triton==2.2.0 [pip3] vllm-nccl-cu12==2.18.1.0.3.0 [conda] Could not collectROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-7 0 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
How would you like to use vllm
i was using it just fine before upgrading to the new release 0.4.1, flash attention doesnt work anymore even tho its installed