[BUG] torch 2.4 rocm 6.1| libc10_cuda.so: cannot open shared object file: No such file or directory

unclemusclez commented 3 weeks ago

Prerequisites

[X] I have read the documentation.
[X] I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config training.yml

task: sentence-transformers:pair_score
base_model: /mnt/p/Models/HuggingFace/hub/models--mistralai--Mamba-Codestral-7B-v0.1/snapshots/a92680008f39180a70b2c22145963a93caa84ccc
project_name: mamaba-codestral-7b-oh-devinator
log: tensorboard
backend: local

data:
  path: skratos115/opendevin_DataDevinator
  train_split: train
  valid_split: null
  column_mapping:
    sentence1_column: instruction
    sentence2_column: prompt
    sentence3_column: solution
    target_column: grade

params:
  max_seq_length: 512
  epochs: 5
  batch_size: 8
  lr: 0.00004
  optimizer: adamw_torch
  scheduler: linear
  gradient_accumulation: 4
  mixed_precision: null
  add_eos_token: true
  block_size: 1024
  model_max_length: -1
  use_flash_attention_2: false
  disable_gradient_checkpointing: false
  logging_steps: -1
  eval_strategy: epoch
  mixed_precision: fp16
  warmup_ratio: 0.1
  weight_decay: 0.0
  max_grad_norm: 1.0
  model_ref: null
  max_prompt_length: 512
  max_completion_length: null
  unsloth: false

UI Screenshots & Parameters

No response

Error Logs

ERROR    | 2024-08-20 20:03:50 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last):
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1603, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/models/mamba2/modeling_mamba2.py", line 42, in <module>
    if is_mamba_2_ssm_available():
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 396, in is_mamba_2_ssm_available
    import mamba_ssm
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/mamba_ssm/__init__.py", line 3, in <module>
    from mamba_ssm.ops.selective_scan_interface import selective_scan_fn, mamba_inner_fn
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 16, in <module>
    import selective_scan_cuda
ImportError: libc10_cuda.so: cannot open shared object file: No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/autotrain/trainers/common.py", line 117, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/autotrain/trainers/sent_transformers/__main__.py", line 158, in train
    model = SentenceTransformer(
            ^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 299, in __init__
    modules = self._load_auto_model(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 1324, in _load_auto_model
    transformer_model = Transformer(
                        ^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/models/Transformer.py", line 54, in __init__
    self._load_model(model_name_or_path, config, cache_dir, **model_args)
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/models/Transformer.py", line 85, in _load_model
    self.auto_model = AutoModel.from_pretrained(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    model_class = _get_model_class(config, cls._model_mapping)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class
    supported_models = model_mapping[type(config)]
                       ~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__
    return self._load_attr_from_module(model_type, model_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module
    return getattribute_from_module(self._modules[module_name], attr)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
    if hasattr(module, attr):
       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1593, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1605, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.mamba2.modeling_mamba2 because of the following error (look up to see its traceback):
libc10_cuda.so: cannot open shared object file: No such file or directory

ERROR    | 2024-08-20 20:03:50 | autotrain.trainers.common:wrapper:121 - Failed to import transformers.models.mamba2.modeling_mamba2 because of the following error (look up to see its traceback):
libc10_cuda.so: cannot open shared object file: No such file or directory

Additional Information

using torch==2.4.0+rocm6.1 WSL2 linux... not using conda. libc10_cuda.so seems to be a libtorch file any feedback on the training methods are certainly welcome

abhishekkrthakur commented 3 weeks ago

rocm has not been tested. im assuming many models dont support it.

unclemusclez commented 3 weeks ago

i fixed this i believe with:

# Clone bitsandbytes repo, ROCm backend is currently enabled on multi-backend-refactor branch
git clone --depth 1 -b multi-backend-refactor https://github.com/TimDettmers/bitsandbytes.git && cd bitsandbytes/

# Install dependencies
pip install -r requirements-dev.txt

# Compile & install
apt-get install -y build-essential cmake  # install build tools dependencies, unless present
cmake -DCOMPUTE_BACKEND=hip -S .  # Use -DBNB_ROCM_ARCH="gfx90a;gfx942" to target specific gpu arch
make
pip install -e .   # `-e` for "editable" install, when developing BNB (otherwise leave that out)

from: https://huggingface.co/docs/bitsandbytes/main/en/installation#compile-from-source note: # Use -DBNB_ROCM_ARCH="gfx90a;gfx942" to target specific gpu arch in my case the line should read: cmake -DCOMPUTE_BACKEND=hip -S -DBNB_ROCM_ARCH="gfx1100" .

there seems to be an underlying issue with HIP I identify here: https://github.com/huggingface/autotrain-advanced/issues/737

huggingface / autotrain-advanced

[BUG] torch 2.4 rocm 6.1| libc10_cuda.so: cannot open shared object file: No such file or directory #735