huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.36k stars 946 forks source link

Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' #2144

Open Hhhh8 opened 1 week ago

Hhhh8 commented 1 week ago

System Info

OS version: WSL 2. ubuntu 22.04 model: llama3-8B-Instruct Hardware: no GPU

There is no gpu, but I installed the nvcc library in wsl using this command. sudo apt install nvidia-cuda-toolkit And no $CUDA_HOME, $LD_LIBRARY_PATH

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

$ nvidia-smi
Command 'nvidia-smi' not found

Information

Tasks

Reproduction

  1. In WSL shell, I ran a below command

    docker run --shm-size 1g -p 8080:80 \
    -v ${hf_model_download_path}:/data \
    -e HF_TOKEN=${my_hf_api_token} \
    --name tgi \
    ghcr.io/huggingface/text-generation-inference:latest --model-id meta-llama/Meta-Llama-3-8B-Instruct --disable-custom-kernels
  2. error log

    ...
    2024-06-29T07:29:12.599418Z  INFO download: text_generation_launcher: Successfully downloaded weights.
    2024-06-29T07:29:12.637348Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
    2024-06-29T07:29:21.467781Z  INFO text_generation_launcher: Detected system cpu
    2024-06-29T07:29:22.678981Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
    2024-06-29T07:29:25.389048Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.layers.layernorm' (/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/layernorm.py)
    2024-06-29T07:29:32.697783Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
    2024-06-29T07:29:42.713623Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
    ...
  3. So I went into Docker directly, ran make install, and found the error log.

$ docker run --rm --entrypoint /bin/bash -it  \
  -e HF_TOKEN=${my_hf_api_token} \
  -v ${hf_model_download_path}:/data -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest  

root@984a3b8b4a4c:/usr/src/server# pip install flash-attn==v2.5.9.post1
Collecting flash-attn==v2.5.9.post1
  Downloading flash_attn-2.5.9.post1.tar.gz (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 3.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [23 lines of output]
      No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
      fatal: not a git repository (or any of the parent directories): .git
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-mc4cargc/flash-attn_b8fe41f0c83d4045a248ec2027dda9da/setup.py", line 113, in <module>
          _, bare_metal_version = get_cuda_bare_metal_version(CUDA_HOME)
        File "/tmp/pip-install-mc4cargc/flash-attn_b8fe41f0c83d4045a248ec2027dda9da/setup.py", line 65, in get_cuda_bare_metal_version
          raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
        File "/opt/conda/lib/python3.10/subprocess.py", line 421, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/opt/conda/lib/python3.10/subprocess.py", line 503, in run
          with Popen(*popenargs, **kwargs) as process:
        File "/opt/conda/lib/python3.10/subprocess.py", line 971, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "/opt/conda/lib/python3.10/subprocess.py", line 1863, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'

      torch.__version__  = 2.3.0

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Expected behavior

Even though I removed the --gpus tag and added the --disable-custom-kernels tag according to the tgi GitHub instructions, the flash error continues to occur. Please tell me how I can run TGI on CPU.

Note: To use NVIDIA GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

danieldk commented 4 days ago

I think flash attention might be a red herring here. The error:

2024-06-29T07:29:25.389048Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.layers.layernorm' (/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/layernorm.py)

Indicates that FastLayerNorm cannot be imported, which happens because without a GPU the system type will be detected as CPU and there are only CUDA, ROCm, and IPEX implementations available for FastLayerNorm.