hpcaitech / FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters
Apache License 2.0
557 stars 86 forks source link

fatal error: cuda.h: No such file or directory #114

Open zzy221127 opened 1 year ago

zzy221127 commented 1 year ago

Dear author:

I try to test Fastfold, after followed the Installation Using Conda, (i think there are no command to test for a successful installation)

I run inference.py with the following code:

################################# conda activate fastfold python /home/FastFold/inference.py used.fasta /database/alphafold2-data/pdb_mmcif/mmcif_files/ \ --output_dir /mydir/output \ --cpus 80 \ --gpus 3 \ --param_path /database/alphafold2-data/params/params_model_1.npz \ --uniref90_database_path /database/alphafold2-data/uniref90/uniref90.fasta \ --mgnify_database_path /database/alphafold2-data/mgnify/mgy_clusters_2018_12.fa \ --pdb70_database_path /database/alphafold2-data/pdb70/pdb70 \ --uniclust30_database_path /database/alphafold2-data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --bfd_database_path /database/alphafold2-data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --jackhmmer_binary_path /home/Software/miniconda3/envs/fastfold/bin/jackhmmer \ --hhblits_binary_path /home/Software/miniconda3/envs/fastfold/bin/hhblits \ --hhsearch_binary_path /home/Software/miniconda3/envs/fastfold/bin/hhsearch \ --kalign_binary_path /home/Software/miniconda3/envs/fastfold/bin/kalign #################################

It seems right at the jackhmmer→hhsearch→jackhmmer→hhblits steps

then I meet error print as follow:

I woundering what they hints and what should i do to run fastfold properly?

##########error message##################

/tmp/tmp4wm30exa/main.c:2:10: fatal error: cuda.h: No such file or directory 2 | #include "cuda.h" | ^~~~ /tmp/tmp65558a3s/main.c:2:10: fatal error: cuda.h: No such file or directory 2 | #include "cuda.h" | ^~~~ compilation terminated. compilation terminated. Traceback (most recent call last): File "/home/FastFold/inference.py", line 513, in main(args) File "/home/FastFold/inference.py", line 150, in main inference_monomer_model(args) File "/home/FastFold/inference.py", line 415, in inference_monomer_model torch.multiprocessing.spawn(inference_model, nprocs=args.gpus, args=(args.gpus, result_q, batch, args)) File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "", line 21, in _layer_norm_fwd_fused KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-bb0203f280ee2aaa28bc6e4eff4090f3-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'fp32'), (256,), (True, True, True, True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/home/FastFold/inference.py", line 135, in inference_model out = model(batch) File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/home/FastFold/fastfold/model/hub/alphafold.py", line 507, in forward outputs, m_1_prev, z_prev, x_prev = self.iteration( File "/home/FastFold/fastfold/model/hub/alphafold.py", line 232, in iteration m_1_prev, z_prev = self.recycling_embedder( File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "/home/FastFold/fastfold/model/fastnn/ops.py", line 1097, in forward m_update = self.layer_norm_m(m) File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/home/FastFold/fastfold/model/fastnn/kernel/layer_norm.py", line 52, in forward return self.kernel_forward(input) File "/home/FastFold/fastfold/model/fastnn/kernel/layer_norm.py", line 56, in kernel_forward return LayerNormTritonFunc.apply(input, self.normalized_shape, self.weight, self.bias, File "/home/FastFold/fastfold/model/fastnn/kernel/triton/layer_norm.py", line 164, in forward _layer_norm_fwd_fused[(M,)]( File "/home/triton/python/triton/runtime/jit.py", line 106, in launcher return self.run(*args, grid=grid, **kwargs) File "", line 41, in _layer_norm_fwd_fused File "/home/triton/python/triton/compiler.py", line 1239, in compile so = _build(fn.name, src_path, tmpdir) File "/home/triton/python/triton/compiler.py", line 1169, in _build ret = subprocess.check_call(cc_cmd) File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp65558a3s/main.c', '-O3', '-I/usr/local/cuda/include', '-I/home/Software/miniconda3/envs/fastfold/include/python3.8', '-I/tmp/tmp65558a3s', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmp65558a3s/_layer_norm_fwd_fused.cpython-38-x86_64-linux-gnu.so', '-L/usr/lib/x86_64-linux-gnu']' returned non-zero exit status 1.

Shenggan commented 1 year ago

Could you please check for your cuda environment, suppose you should have your nvcc compiler.

nvcc -V

If you do not have cuda compiler. conda environment maybe only contain cuda runtime. So you can choose to install fully CUDA environment from NVIDIA website or you can try to install development environment in conda.

zzy221127 commented 1 year ago

thankyou, below is what 'nvcc -V' shows, it seems the cuda compiler is already in

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0

Shenggan commented 1 year ago

Ok, could you please provide your cuda path with which nvcc, and the way you install triton.

The simple way is to uninstall triton, and the code will fallback to cuda kernel.

zzy221127 commented 1 year ago

thankyou so much! After your kindly remind, it find out to be the installion problem with triton.

I first install triton with command:

pip install triton==2.0.0.dev20221005

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting triton==2.0.0.dev20221005
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/11/f3/db2d366485b3160419f8415e0293aac6daaa018d7a02b9c0a40f89a137bf/triton-2.0.0.dev20221005-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.7 MB)
Requirement already satisfied: torch in /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages (from triton==2.0.0.dev20221005) (1.13.0+cu117)
Requirement already satisfied: filelock in /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages (from triton==2.0.0.dev20221005) (3.8.0)
Requirement already satisfied: cmake in /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages (from triton==2.0.0.dev20221005) (3.24.3)
Requirement already satisfied: typing-extensions in /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages (from torch->triton==2.0.0.dev20221005) (4.4.0)
Installing collected packages: triton
Successfully installed triton-2.0.0.dev20221005

I seems ok.

then, I used the following command to install triton again.

git clone https://github.com/openai/triton.git ~/triton \
 && cd ~/triton/python \
 && pip install -e . -i https://pypi.tuna.tsinghua.edu.cn/simple --default-timeout=10000000

and got the error message below, do you have any suggestions for this?

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:///home/triton/python
  Preparing metadata (setup.py) ... done
Requirement already satisfied: cmake in /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages (from triton==2.0.0) (3.24.3)
Requirement already satisfied: filelock in /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages (from triton==2.0.0) (3.8.0)
Requirement already satisfied: torch in /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages (from triton==2.0.0) (1.13.0+cu117)
Requirement already satisfied: typing-extensions in /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages (from torch->triton==2.0.0) (4.4.0)
Installing collected packages: triton
  Attempting uninstall: triton
    Found existing installation: triton 2.0.0
    Uninstalling triton-2.0.0:
      Successfully uninstalled triton-2.0.0
  Running setup.py develop for triton
    error: subprocess-exited-with-error

    × python setup.py develop did not run successfully.
    │ exit code: 1
    ╰─> [59 lines of output]
        running develop
        running egg_info
        writing triton.egg-info/PKG-INFO
        writing dependency_links to triton.egg-info/dependency_links.txt
        writing requirements to triton.egg-info/requires.txt
        writing top-level names to triton.egg-info/top_level.txt
        reading manifest file 'triton.egg-info/SOURCES.txt'
        reading manifest template 'MANIFEST.in'
        writing manifest file 'triton.egg-info/SOURCES.txt'
        running build_ext
        /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
          warnings.warn(
        /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
          warnings.warn(
        Traceback (most recent call last):
          File "<string>", line 2, in <module>
          File "<pip-setuptools-caller>", line 34, in <module>
          File "/home/triton/python/setup.py", line 152, in <module>
            setup(
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/__init__.py", line 87, in setup
            return distutils.core.setup(**attrs)
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup
            return run_commands(dist)
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
            dist.run_commands()
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
            self.run_command(cmd)
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/dist.py", line 1217, in run_command
            super().run_command(command)
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
            cmd_obj.run()
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/develop.py", line 34, in run
            self.install_for_development()
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/develop.py", line 114, in install_for_development
            self.run_command('build_ext')
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
            self.distribution.run_command(command)
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/dist.py", line 1217, in run_command
            super().run_command(command)
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
            cmd_obj.run()
          File "/home/triton/python/setup.py", line 114, in run
            self.build_extension(ext)
          File "/home/triton/python/setup.py", line 118, in build_extension
            thirdparty_cmake_args = get_thirdparty_packages(triton_cache_path)
          File "/home/triton/python/setup.py", line 74, in get_thirdparty_packages
            file.extractall(path=package_root_dir)
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/tarfile.py", line 2028, in extractall
            self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/tarfile.py", line 2069, in extract
            self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/tarfile.py", line 2141, in _extract_member
            self.makefile(tarinfo, targetpath)
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/tarfile.py", line 2190, in makefile
            copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
          File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/tarfile.py", line 249, in copyfileobj
            raise exception("unexpected end of data")
        tarfile.ReadError: unexpected end of data
        downloading and extracting https://github.com/llvm/llvm-project/releases/download/llvmorg-15.0.4/clang+llvm-15.0.4-powerpc64le-linux-ubuntu-18.04.5.tar.xz ...
        [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip.
  Rolling back uninstall of triton
  Moving to /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/triton.egg-link
   from /tmp/pip-uninstall-q6f21a3r/triton.egg-link
error: subprocess-exited-with-error

× python setup.py develop did not run successfully.
│ exit code: 1
╰─> [59 lines of output]
    running develop
    running egg_info
    writing triton.egg-info/PKG-INFO
    writing dependency_links to triton.egg-info/dependency_links.txt
    writing requirements to triton.egg-info/requires.txt
    writing top-level names to triton.egg-info/top_level.txt
    reading manifest file 'triton.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'triton.egg-info/SOURCES.txt'
    running build_ext
    /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
      warnings.warn(
    /home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
      warnings.warn(
    Traceback (most recent call last):
      File "<string>", line 2, in <module>
      File "<pip-setuptools-caller>", line 34, in <module>
      File "/home/triton/python/setup.py", line 152, in <module>
        setup(
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/__init__.py", line 87, in setup
        return distutils.core.setup(**attrs)
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup
        return run_commands(dist)
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
        dist.run_commands()
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
        self.run_command(cmd)
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/dist.py", line 1217, in run_command
        super().run_command(command)
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
        cmd_obj.run()
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/develop.py", line 34, in run
        self.install_for_development()
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/develop.py", line 114, in install_for_development
        self.run_command('build_ext')
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
        self.distribution.run_command(command)
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/dist.py", line 1217, in run_command
        super().run_command(command)
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
        cmd_obj.run()
      File "/home/triton/python/setup.py", line 114, in run
        self.build_extension(ext)
      File "/home/triton/python/setup.py", line 118, in build_extension
        thirdparty_cmake_args = get_thirdparty_packages(triton_cache_path)
      File "/home/triton/python/setup.py", line 74, in get_thirdparty_packages
        file.extractall(path=package_root_dir)
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/tarfile.py", line 2028, in extractall
        self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/tarfile.py", line 2069, in extract
        self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/tarfile.py", line 2141, in _extract_member
        self.makefile(tarinfo, targetpath)
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/tarfile.py", line 2190, in makefile
        copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
      File "/home/Software/miniconda3/envs/fastfold/lib/python3.8/tarfile.py", line 249, in copyfileobj
        raise exception("unexpected end of data")
    tarfile.ReadError: unexpected end of data
    downloading and extracting https://github.com/llvm/llvm-project/releases/download/llvmorg-15.0.4/clang+llvm-15.0.4-powerpc64le-linux-ubuntu-18.04.5.tar.xz ...
    [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
Shenggan commented 1 year ago

The log shows that maybe the network problem, you can not download the llvm from github. You should use pip install triton==2.0.0.dev20221005 to install specify version triton. The main branch of triton is not stable. If you struggle with triton, just uninstall it and run again.

zzy221127 commented 1 year ago

Dear Shenggan:

by uninstall triton, I successful run out the inference.py scripts with no error print.

the out put is one relaxed.pdb, one unrelaxed.pbd, with one " alignments" folder , right?

Although I definitely feel much faster than runing alphafold2,

but i woundering without triton, am i " leverage the power of FastFold" ?

Shenggan commented 1 year ago

The expected output file is correct.

You can already get great acceleration with the cuda kernel when triton is not installed. Triton kernel is currently experimental. It can have some acceleration effect on NVIDIA Ampere platform (maybe 10%~20%).

I think you can try to use triton==2.0.0.dev20221005 and figure out why it can not find cuda.h. I think you can try to set environment variables CUDA_HOME to your cuda path.