Open unclemusclez opened 3 weeks ago
rocm has not been tested. im assuming many models dont support it.
i fixed this i believe with:
# Clone bitsandbytes repo, ROCm backend is currently enabled on multi-backend-refactor branch
git clone --depth 1 -b multi-backend-refactor https://github.com/TimDettmers/bitsandbytes.git && cd bitsandbytes/
# Install dependencies
pip install -r requirements-dev.txt
# Compile & install
apt-get install -y build-essential cmake # install build tools dependencies, unless present
cmake -DCOMPUTE_BACKEND=hip -S . # Use -DBNB_ROCM_ARCH="gfx90a;gfx942" to target specific gpu arch
make
pip install -e . # `-e` for "editable" install, when developing BNB (otherwise leave that out)
from: https://huggingface.co/docs/bitsandbytes/main/en/installation#compile-from-source
note: # Use -DBNB_ROCM_ARCH="gfx90a;gfx942" to target specific gpu arch
in my case the line should read:
cmake -DCOMPUTE_BACKEND=hip -S -DBNB_ROCM_ARCH="gfx1100" .
there seems to be an underlying issue with HIP I identify here: https://github.com/huggingface/autotrain-advanced/issues/737
Prerequisites
Backend
Local
Interface Used
CLI
CLI Command
autotrain --config training.yml
UI Screenshots & Parameters
No response
Error Logs
Additional Information
using torch==2.4.0+rocm6.1 WSL2 linux... not using conda.
libc10_cuda.so
seems to be a libtorch file any feedback on the training methods are certainly welcome