vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.71k stars 4.66k forks source link

[Usage]: Flash Attention not working any more #4322

Open Techinix opened 7 months ago

Techinix commented 7 months ago

Your current environment

2024-04-24 06:04:07 (27.2 MB/s) - ‘collect_env.py’ saved [24877/24877]

Collecting environment information... PyTorch version: 2.2.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.2 Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe Nvidia driver version: 550.54.15 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: AuthenticAMD Model name: AMD EPYC 74F3 24-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 1 Stepping: 1 BogoMIPS: 6387.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 512 KiB (8 instances) L1i cache: 512 KiB (8 instances) L2 cache: 4 MiB (8 instances) L3 cache: 128 MiB (8 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.19.3 [pip3] torch==2.2.1 [pip3] torchsummary==1.5.1 [pip3] triton==2.2.0 [pip3] vllm-nccl-cu12==2.18.1.0.3.0 [conda] Could not collectROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-7 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

i was using it just fine before upgrading to the new release 0.4.1, flash attention doesnt work anymore even tho its installed

simon-mo commented 7 months ago

Any error message or repro?

AmoghM commented 7 months ago

+1 Same issue. No error message. The log just says: INFO 04-24 14:55:32 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.

aldopareja commented 7 months ago

same here: Cannot use FlashAttention backend because the flash_attn package is not found

aldopareja commented 7 months ago

using the pytorch image (nvcr.io/nvidia/pytorch:24.03-py3) with this script:

export HOME=/dev/shm/
export PATH=$HOME/.local/bin/:$PATH
pip install vllm==v0.4.1
ray disable-usage-stats
ray stop --force
export OPENBLAS_NUM_THREADS=32
export OMP_NUM_THREADS=32
pkill python
ray start --head --num-cpus=32 --num-gpus=8
python -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x22B-Instruct-v0.1 --dtype float16 --tensor-parallel-size 8
ruifengma commented 7 months ago

I actually succcessfully installed flash-attn 2.5.7 with vllm 0.4.1 and it can be detected with vllm (Using FlashAttention backend). But the performance is remaining the same (there is not any speed improvement or anything else).

Techinix commented 7 months ago

Actually it only breaks when I upgrade from 0.4.0.post1 to 0.4.1 , i think it has something to do with one of these bad boys : Installing collected packages: triton, nvidia-nccl-cu12, torch, xformers, vllm Attempting uninstall: triton Found existing installation: triton 2.2.0 Uninstalling triton-2.2.0: Successfully uninstalled triton-2.2.0 Attempting uninstall: nvidia-nccl-cu12 Found existing installation: nvidia-nccl-cu12 2.19.3 Uninstalling nvidia-nccl-cu12-2.19.3: Successfully uninstalled nvidia-nccl-cu12-2.19.3 Attempting uninstall: torch Found existing installation: torch 2.2.1 Uninstalling torch-2.2.1: Successfully uninstalled torch-2.2.1 Attempting uninstall: xformers Found existing installation: xformers 0.0.25 Uninstalling xformers-0.0.25: Successfully uninstalled xformers-0.0.25 Attempting uninstall: vllm Found existing installation: vllm 0.4.1 Uninstalling vllm-0.4.1: Successfully uninstalled vllm-0.4.1 Successfully installed nvidia-nccl-cu12-2.18.1 torch-2.1.2 triton-2.1.0 vllm-0.4.0.post1 xformers-0.0.23.post1

Flash attention stops working totally even when importing it (flash_attn)

ferrybaltimore commented 7 months ago

Same problem for me, with 2.5.7 installed. I came back to 0.4.0.post1

JaheimLee commented 7 months ago

I actually succcessfully installed flash-attn 2.5.7 with vllm 0.4.1 and it can be detected with vllm (Using FlashAttention backend). But the performance is remaining the same (there is not any speed improvement or anything else).

Same problem

DaBossCoda commented 6 months ago

Same

algrshn commented 6 months ago

Is the problem still there? I have the same speed with 0.4.0.post1 and 0.4.1. In both cases I have flash_attn installed and in both cases when openai server starts it detects Flash Attention:

INFO 05-05 13:49:23 selector.py:28] Using FlashAttention backend.

But this is the same speed as without FlashAttention (Using XFormers backend)

Yaomt commented 5 months ago

I actually succcessfully installed flash-attn 2.5.7 with vllm 0.4.1 and it can be detected with vllm (Using FlashAttention backend). But the performance is remaining the same (there is not any speed improvement or anything else).

Same problem

same problem

llsj14 commented 5 months ago

I am experiencing the same issue. I tested with vLLM v0.4.1 and flash_attn v2.5.7. I also tested with the GQA and MHA models, both with and without Tensor Parallelism, and with input lengths of 1024, 2048, 4096, 8192, and 16384. However, the result is the same.

AayushSameerShah commented 5 months ago

Hack, I have this too:

INFO 06-25 06:57:45 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-25 06:57:45 selector.py:51] Using XFormers backend.
INFO 06-25 06:57:48 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-25 06:57:48 selector.py:51] Using XFormers backend.
INFO 06-25 06:57:49 weight_utils.py:207] Using model weights format ['*.safetensors']
INFO 06-25 06:57:49 weight_utils.py:250] No model.safetensors.index.json found in remote.
INFO 06-25 06:58:04 model_runner.py:146] Loading model weights took 3.6769 GB
INFO 06-25 06:58:09 gpu_executor.py:83] # GPU blocks: 1044, # CPU blocks: 512
INFO 06-25 06:58:11 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-25 06:58:11 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-25 06:58:21 model_runner.py:924] Graph capturing finished in 10 secs.
INFO:     Started server process [41253]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
INFO 06-25 06:58:46 async_llm_engine.py:553] Received request 28c0b20f7f39

Currently I am getting the following speed:


INFO 06-25 06:58:47 metrics.py:341] Avg prompt throughput: 45.2 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.1%, CPU KV cache usage: 0.0%.
INFO 06-25 06:58:52 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.6%, CPU KV cache usage: 0.0%.
INFO 06-25 06:58:57 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 48.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 10.1%, CPU KV cache usage: 0.0%.

Do you think if flash attention is there, I can improve my generation speed? Thanks!

zyxnlp commented 4 months ago

I actually succcessfully installed flash-attn 2.5.7 with vllm 0.4.1 and it can be detected with vllm (Using FlashAttention backend). But the performance is remaining the same (there is not any speed improvement or anything else).

Hi @ruifengma, I'm not sure if this answered your issue. https://github.com/vllm-project/vllm/issues/485#issuecomment-1693009046

AmazingTurtle commented 3 months ago

Running into the same problem:

docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=xxx" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model openbmb/MiniCPM-V-2_6 \
    --trust-remote-code

gives

INFO 08-12 10:00:18 api_server.py:339] vLLM API server version 0.5.4
INFO 08-12 10:00:18 api_server.py:340] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='openbmb/MiniCPM-V-2_6', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 370, in <module>
    asyncio.run(run_server(args))
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 342, in run_server
    async with build_async_engine_client(args) as async_engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
    if (model_is_embedding(args.model, args.trust_remote_code)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 64, in model_is_embedding
    return ModelConfig(model=model_name,
  File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 158, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/config.py", line 46, in get_config
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 981, in from_pretrained
    config_class = get_class_from_dynamic_module(
  File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 502, in get_class_from_dynamic_module
    final_module = get_cached_module_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 365, in get_cached_module_file
    get_cached_module_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 327, in get_cached_module_file
    modules_needed = check_imports(resolved_module_file)
  File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 182, in check_imports
    raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run `pip install flash_attn`
reizam commented 3 months ago

same things with skypilot deploy on aws :

INFO 08-26 14:56:26 api_server.py:441] args: Namespace(host='0.0.0.0', port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='test', lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='openbmb/MiniCPM-V-2_6', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
(sky-service-97f7, pid=29148) Traceback (most recent call last):
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/runpy.py", line 197, in _run_module_as_main
(sky-service-97f7, pid=29148)     return _run_code(code, main_globals, None,
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/runpy.py", line 87, in _run_code
(sky-service-97f7, pid=29148)     exec(code, run_globals)
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 476, in <module>
(sky-service-97f7, pid=29148)     asyncio.run(run_server(args))
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/asyncio/runners.py", line 44, in run
(sky-service-97f7, pid=29148)     return loop.run_until_complete(main)
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
(sky-service-97f7, pid=29148)     return future.result()
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 443, in run_server
(sky-service-97f7, pid=29148)     async with build_async_engine_client(args) as async_engine_client:
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/contextlib.py", line 181, in __aenter__
(sky-service-97f7, pid=29148)     return await self.gen.__anext__()
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 117, in build_async_engine_client
(sky-service-97f7, pid=29148)     if (model_is_embedding(args.model, args.trust_remote_code,
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 71, in model_is_embedding
(sky-service-97f7, pid=29148)     return ModelConfig(model=model_name,
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/config.py", line 169, in __init__
(sky-service-97f7, pid=29148)     self.hf_config = get_config(self.model, trust_remote_code, revision,
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/vllm/transformers_utils/config.py", line 64, in get_config
(sky-service-97f7, pid=29148)     config = AutoConfig.from_pretrained(
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 985, in from_pretrained
(sky-service-97f7, pid=29148)     config_class = get_class_from_dynamic_module(
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 502, in get_class_from_dynamic_module
(sky-service-97f7, pid=29148)     final_module = get_cached_module_file(
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 365, in get_cached_module_file
(sky-service-97f7, pid=29148)     get_cached_module_file(
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 327, in get_cached_module_file
(sky-service-97f7, pid=29148)     modules_needed = check_imports(resolved_module_file)
(sky-service-97f7, pid=29148)   File "/opt/conda/envs/vllm/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 182, in check_imports
(sky-service-97f7, pid=29148)     raise ImportError(
(sky-service-97f7, pid=29148) ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run `pip install flash_attn`
ERROR: Job 1 failed with return code list: [1]