ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects

VincentXWD commented 10 months ago

Dear authors, thanks for your efforts on such a great benchmarking work on LLM. I learned this repository and met some problems deploying this project.

When I was running the command below:

docker build --no-cache -t llm-perf-exllama-v2:v0.1        -f ./docker/Dockerfile.cu121.exllama_v2 .

It would give me an error: ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects.

I used to successfully benchmarked "MLC LLM" but I met such an error when handling with "Exllama V2".

the "nvidia-smi" outputs:

Fri Oct 27 01:08:12 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:06:00.0 Off |                  N/A |
| 30%   36C    P8               16W / 350W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         On | 00000000:41:00.0 Off |                  N/A |
| 30%   42C    P8               22W / 350W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090         On | 00000000:61:00.0 Off |                  N/A |
| 39%   35C    P8               16W / 350W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Here is the full log of the step 7:

Step 7/7 : RUN source ~/.bashrc && micromamba activate python311                       &&     MAX_JOBS=4 python -m pip install flash-attn --no-build-isolation
 ---> Running in 28d2c70206c1
Collecting flash-attn
  Downloading flash_attn-2.3.3.tar.gz (2.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 24.8 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: torch in /root/micromamba/envs/python311/lib/python3.11/site-packages (from flash-attn) (2.2.0.dev20231026)
Collecting einops (from flash-attn)
  Downloading einops-0.7.0-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: packaging in /root/micromamba/envs/python311/lib/python3.11/site-packages (from flash-attn) (22.0)
Collecting ninja (from flash-attn)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Requirement already satisfied: filelock in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (3.9.0)
Requirement already satisfied: typing-extensions in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (4.8.0)
Requirement already satisfied: sympy in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (1.12)
Requirement already satisfied: networkx in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (3.2)
Requirement already satisfied: jinja2 in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (3.1.2)
Requirement already satisfied: fsspec in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (2023.10.0)
Requirement already satisfied: MarkupSafe>=2.0 in /root/micromamba/envs/python311/lib/python3.11/site-packages (from jinja2->torch->flash-attn) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /root/micromamba/envs/python311/lib/python3.11/site-packages (from sympy->torch->flash-attn) (1.2.1)
Downloading einops-0.7.0-py3-none-any.whl (44 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 kB 1.7 MB/s eta 0:00:00
Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 12.3 MB/s eta 0:00:00
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py): started
  Building wheel for flash-attn (setup.py): finished with status 'error'
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [40 lines of output]
      No CUDA runtime is found, using CUDA_HOME='/root/micromamba/envs/python311'
      fatal: not a git repository (or any of the parent directories): .git

      torch.__version__  = 2.2.0.dev20231026

      running bdist_wheel
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-mjt1ot_6/flash-attn_d426c8d13b1a498c98aa462ee75f2537/setup.py", line 288, in <module>
          setup(
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/__init__.py", line 103, in setup
          return distutils.core.setup(**attrs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
                 ^^^^^^^^^^^^^^^^^^
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/dist.py", line 989, in run_command
          super().run_command(command)
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/pip-install-mjt1ot_6/flash-attn_d426c8d13b1a498c98aa462ee75f2537/setup.py", line 265, in run
          wheel_url, wheel_filename = get_wheel_url()
                                      ^^^^^^^^^^^^^^^
        File "/tmp/pip-install-mjt1ot_6/flash-attn_d426c8d13b1a498c98aa462ee75f2537/setup.py", line 234, in get_wheel_url
          torch_cuda_version = parse(torch.version.cuda)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/packaging/version.py", line 52, in parse
          return Version(version)
                 ^^^^^^^^^^^^^^^^
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/packaging/version.py", line 195, in __init__
          match = self._regex.search(version)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
      TypeError: expected string or bytes-like object, got 'NoneType'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for flash-attn
  Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects
The command '/bin/bash -ec source ~/.bashrc && micromamba activate python311                       &&     MAX_JOBS=4 python -m pip install flash-attn --no-build-isolation' returned a non-zero code: 1

I'm very appreciate it if anyone could have a look at this problem and give some advises. Thanks!

junrushao commented 10 months ago

Hey thanks for asking!

According to your error message, I was not super sure what happened with the one below:

      No CUDA runtime is found, using CUDA_HOME='/root/micromamba/envs/python311'

FlashAttention doesn't always necessarily help with performance, so if that's the source of the issue, perhaps it is fine to disable it.

VincentXWD commented 10 months ago

@junrushao Thanks for replying. Now I can run the container and run the exllama ideally. Here is my debug log for solving some about this stuff. Maybe it's helpful for the dockerfile.

annotated a line in docker/Dockerfile.cu121.exllama_v2:

# pip install flash-attn --no-build-isolation

cannot run the inference properly.

(python311) root@f039446:/exllamav2# python test_inference.py -m $MODEL_PATH -p "What is the meaning of life?" -t $OUTPUT_LEN
No CUDA runtime is found, using CUDA_HOME='/root/micromamba/envs/python311'
Traceback (most recent call last):
  File "/exllamav2/exllamav2/ext.py", line 14, in <module>
    import exllamav2_ext
ModuleNotFoundError: No module named 'exllamav2_ext'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/exllamav2/test_inference.py", line 2, in <module>
    from exllamav2 import(
  File "/exllamav2/exllamav2/__init__.py", line 3, in <module>
    from exllamav2.model import ExLlamaV2
  File "/exllamav2/exllamav2/model.py", line 17, in <module>
    from exllamav2.cache import ExLlamaV2CacheBase
  File "/exllamav2/exllamav2/cache.py", line 2, in <module>
    from exllamav2.ext import exllamav2_ext as ext_c
  File "/exllamav2/exllamav2/ext.py", line 124, in <module>
    exllamav2_ext = load \
                    ^^^^^^
  File "/root/micromamba/envs/python311/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/micromamba/envs/python311/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/root/micromamba/envs/python311/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1810, in _write_ninja_file_and_build_library
    _write_ninja_file_to_build_library(
  File "/root/micromamba/envs/python311/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2201, in _write_ninja_file_to_build_library
    cuda_flags = common_cflags + COMMON_NVCC_FLAGS + _get_cuda_arch_flags()
                                                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/micromamba/envs/python311/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1982, in _get_cuda_arch_flags
    arch_list[-1] += '+PTX'
    ~~~~~~~~~^^^^
IndexError: list index out of range
(python311) root@f039446:/exllamav2# pip install exllamav2_ext
ERROR: Could not find a version that satisfies the requirement exllamav2_ext (from versions: none)
ERROR: No matching distribution found for exllamav2_ext

Seems the package exllamav2_ext was not built successfully. Skip temporarily.

As for the error about cuda_runtime. I could run nvidia-smi in container:

(python311) root@f039446:/exllamav2# nvidia-smi
Fri Oct 27 08:05:19 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:06:00.0 Off |                  N/A |
| 30%   35C    P8               16W / 350W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         On | 00000000:41:00.0 Off |                  N/A |
| 30%   41C    P8               22W / 350W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090         On | 00000000:61:00.0 Off |                  N/A |
| 39%   35C    P8               16W / 350W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Soon I checked the pytorch and it's version. Noticed it was not the cuda version.

(python311) root@f039446:/exllamav2# python
Python 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.randn([1]).cuda()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/micromamba/envs/python311/lib/python3.11/site-packages/torch/cuda/__init__.py", line 289, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
>>> torch.__version__
'2.2.0.dev20231026'
>>>

I re-installed pytorch

pip3 install torch torchvision torchaudio

Looks good.

(python311) root@f039446:/exllamav2# python
Python 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
torch._>>> torch.__version__
'2.1.0+cu121'
>>> torch.randn([1]).cuda()
tensor([-0.7436], device='cuda:0')
>>>
(python311) root@f039446:/exllamav2# python test_inference.py -m $MODEL_PATH -p "What is the meaning of life?" -t $OUTPUT_LEN
 -- Model: /workspace/Llama-2-7B-GPTQ/
 -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Warmup...
 -- Generating...

What is the meaning of life? I don’t know, it depends on who you ask. everybody has its own answers to this question
But I do believe that we are here for a reason and I think that this world is not our final destination
The universe is too big to be an accident
I mean there are more than 100 billion galaxies in just one small part of the observable universe (which means that only about 5% of the entire universe can be observed from Earth)
So, if all those galaxies exist, then why would Earth be the only planet where life could possibly evolve and thrive ? This is absurd... So obviously, this earth was created by some higher power or maybe even many different powers.
And what happens after death ...
We have no evidence that anything like consciousness continues once we die. In fact, most likely nothing does. We know nothing about what happens when we die so we cannot speculate as to whether it may be a good thing or bad because we simply don't know... It seems pretty clear though that the fear of death is completely irrational. The chances of being killed by something at any moment is much greater now while alive than after dying.. That is why they say "life is the ult

 -- Response generated in 1.81 seconds, 256 tokens, 141.56 tokens/second (includes prompt eval.)

junrushao commented 10 months ago

Ah I got it! It suggests that the PyTorch dependency went wrong and a CPU-only PyTorch got installed accidentally, even if we are explicitly asking for a CUDA built in the Dockerfile below:

    micromamba create ... pytorch "pytorch-cuda==12.1" ...

In fact, this exact issue happens earlier today when I was trying to install a CUDA-capable PyTorch on my end - I was puzzled but thought I mistakenly made a typo somewhere...

Thanks for reporting this issue! This helps me confirm that there might be a change on upstream PyTorch build. I'm not sure if it's a careless change that could go away in a few days, and if not, please feel free to shoot a PR switching the PyTorch installation to the pip-based one

mlc-ai / llm-perf-bench

ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects #31