state-spaces / mamba

Mamba SSM architecture
Apache License 2.0
13.11k stars 1.11k forks source link

Mamba Installation Failed; PyTorch+ROCm version 6.0 & 6.1 not working #412

Open eliranwong opened 4 months ago

eliranwong commented 4 months ago

Mamba Installation Failed; PyTorch+ROCm version 6.0 & 6.1 not working

I tried to install mamba with two containers on Ubuntu 22.04 LTS, one with ROCm 6.0.2 & PyTorch+rocm6.0 installed, another with ROCm 6.1.2 & PyTorch+rocm6.1 installed.

Notes on my ROCm 6.1.2 setup: https://github.com/eliranwong/incus_container_gui_setup/blob/main/ubuntu_22.04_LTS_latest_rocm_tested.md

Notes on my ROCM 6.0.2 setup: https://github.com/eliranwong/incus_container_gui_setup/blob/main/ubuntu_22.04_LTS_tested.md

I already applied the path https://github.com/state-spaces/mamba/blob/main/rocm_patch/rocm6_0.patch in container running 6.0.2.

When I run pip install mamba-ssm, I encountered errors:

With PyTorch + rocm 6.0

          with open(fin_path, encoding='utf-8') as fin:
      FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-iy9acbtt/mamba-ssm_aae2c1df8bb54f62a59b41fd74fafbe0/csrc/selective_scan/selective_scan.cpp'

      torch.__version__  = 2.3.1+rocm6.0

With PyTorch + rocm 6.1

          with open(fin_path, encoding='utf-8') as fin:
      FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-onfy5yn9/mamba-ssm_70828647adec4a73aa94f62ff7a0c1d1/csrc/selective_scan/selective_scan.cpp'

      torch.__version__  = 2.5.0.dev20240618+rocm6.1
eliranwong commented 4 months ago

I just tried to install directly on the host, but no luck, same errors:

...
        File "/home/eliran/apps/mamba/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py", line 826, in preprocessor
          with open(fin_path, encoding='utf-8') as fin:
      FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-2d_pvv7e/mamba-ssm_19b07ba0f4b54a60b6feb761a9d6d942/csrc/selective_scan/selective_scan.cpp'

      torch.__version__  = 2.3.1+rocm6.0
lfb-julien commented 4 months ago

Same problem . on docker host with rocm 6.0 with your patch or 6.1

pip install mamba-ssm:

  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [19 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-y3jhawqw/mamba-ssm_1617e4dcea5044fabfc486e5325fed98/setup.py", line 239, in <module>
          CUDAExtension(
        File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1098, in CUDAExtension
          hipify_result = hipify_python.hipify(
        File "/usr/local/lib/python3.10/dist-packages/torch/utils/hipify/hipify_python.py", line 1150, in hipify
          preprocess_file_and_save_result(output_directory, filepath, all_files, header_include_dirs,
        File "/usr/local/lib/python3.10/dist-packages/torch/utils/hipify/hipify_python.py", line 206, in preprocess_file_and_save_result
          result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats,
        File "/usr/local/lib/python3.10/dist-packages/torch/utils/hipify/hipify_python.py", line 826, in preprocessor
          with open(fin_path, encoding='utf-8') as fin:
      FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-y3jhawqw/mamba-ssm_1617e4dcea5044fabfc486e5325fed98/csrc/selective_scan/selective_scan.cpp'

      torch.__version__  = 2.5.0.dev20240620+rocm6.1

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

With a compilation i've error

ajassani commented 4 months ago

We don't have support for direct pip installation yet. Can you try building from source: git clone https://github.com/state-spaces/mamba.git cd mamba pip install . Let me know if that works. Thanks!

gabeweisz commented 4 months ago

Instead of checking out, you can also run: pip install git+https://github.com/state-spaces/mamba.git

To check out, build, and install in one step

eliranwong commented 4 months ago

pip install git+https://github.com/state-spaces/mamba.git

Tried, but unsuccessful, errors:

      In file included from /opt/rocm-6.0.2/include/hipcub/backend/rocprim/hipcub.hpp:40:
      /opt/rocm-6.0.2/include/hipcub/backend/rocprim/block/block_load.hpp:134:20: error: no member named 'load' in 'rocprim::block_load<unsigned long, 32, 1, rocprim::block_load_method::block_load_warp_transpose>'
              base_type::load(block_iter, items, valid_items, temp_storage_);
                         ^
      /home/ubuntu/mamba/mamba/csrc/selective_scan/selective_scan_common_hip.h:187:56: note: in instantiation of function template specialization 'hipcub::BlockLoad<unsigned long, 32, 1, hipcub::BLOCK_LOAD_WARP_TRANSPOSE>::Load<unsigned long *>' requested here
              typename Ktraits::BlockLoadVecT(smem_load_vec).Load(
                                                             ^
      /home/ubuntu/mamba/mamba/csrc/selective_scan/selective_scan_bwd_kernel_hip.cuh:159:9: note: in instantiation of function template specialization 'load_input<Selective_Scan_bwd_kernel_traits<32, 4, true, true, true, true, true, c10::BFloat16, c10::complex<float>>>' requested here
              load_input<Ktraits>(u, u_vals, smem_load, params.seqlen - chunk * kChunkSize);
              ^
      /home/ubuntu/mamba/mamba/csrc/selective_scan/selective_scan_bwd_kernel_hip.cuh:513:40: note: in instantiation of function template specialization 'selective_scan_bwd_kernel<Selective_Scan_bwd_kernel_traits<32, 4, true, true, true, true, true, c10::BFloat16, c10::complex<float>>>' requested here
                              auto kernel = &selective_scan_bwd_kernel<Ktraits>;
                                             ^
      /home/ubuntu/mamba/mamba/csrc/selective_scan/selective_scan_bwd_kernel_hip.cuh:548:13: note: in instantiation of function template specialization 'selective_scan_bwd_launch<32, 4, c10::BFloat16, c10::complex<float>>' requested here
                  selective_scan_bwd_launch<32, 4, input_t, weight_t>(params, stream);
                  ^
      fatal error: too many errors emitted, stopping now [-ferror-limit=]
      1 warning and 20 errors generated when compiling for host.
      error: command '/opt/rocm-6.0.2/bin/hipcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for mamba-ssm
  Running setup.py clean for mamba-ssm
Failed to build mamba-ssm
Installing collected packages: ninja, urllib3, triton, tqdm, safetensors, regex, pyyaml, idna, einops, charset-normalizer, certifi, requests, huggingface-hub, tokenizers, transformers, mamba-ssm
  Running setup.py install for mamba-ssm ... \
gabeweisz commented 4 months ago

I ran this successfully on the rocm/pytorch:latest docker image. Can you try?

ajassani commented 4 months ago

@eliranwong We have reproduced this issue and are working to fix it. Thanks for reporting it!

ajassani commented 4 months ago

@eliranwong There is some bug related to the warp size of Radeon in one of the rocm libraries. We are working to fix that. For now, we have a temporary fix in which we compile the same kernel launch parameters for both Instinct and Radeon. The performance hit is negligible. Here's the branch for the fix on our repo: https://github.com/rocm-port/mamba-rocm/tree/radeon_tempfix

eliranwong commented 4 months ago

Your work and updates are much appreciated. Thanks a lot.

george-adams1 commented 3 weeks ago

I'm still having this error even with the new PR.