mathis-lambert commented 1 year ago

Hi,

I might be on the wrong place but, this kind of issues has already been raised on PyTorch repo so i'm trying here.

Context

I want to fine-tune Llama2-70B-chat-hf with any dataset on an Nvidia H100 instance running with CUDA 12.2 v2. To fine-tune it, i chose autotrain-advanced with Python 3.10.

First try

For the first try, i've simply made a venv and installed autotrain-advanced, then run :

$ - autotrain setup

So far, everything has gone successfully... After that, i'm running my train command :

$ - autotrain llm --train --project_name test_llm --model meta-llama/Llama-2-70b-chat-hf --data_path knowrohit07/know_sql  --use_peft --trainer sft --learning_rate 2e-4  --use_int4 --train_batch_size 24 --token [my_token]

And then, i've got this error :

/scratch/torchbuild/lib/python3.10/site-packages/torch/cuda/__init__.py:173: UserWarning:
**NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.**
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

What i tried to handle it

I tried so many things to handle this PyTorch issue :

Install older CUDA Driver : 22.1 (which is supported by the latest PyTorch Nightly version)
Build PyTorch from source in a venv as it is suggested following the PyTorch's repo process
Build with and without conda/mkl
Build on different CUDA Versions

Conclusion

Always this same warning saying me that the PyTorch version isn't compatible for sm_90 capabilities (H100). And ... as reported by ML Engineer at Nvidia : https://github.com/pytorch/pytorch/issues/90761#issuecomment-1673709633

I'm gonna post this also on the PyTorch repo but if someone got the same issue, and fixed it, i won't say no to a little help.

If you need deeper context, let me know and i'll provide it.

Many thanks.

mathis-lambert commented 1 year ago

Also my env information :

Collecting environment information... /scratch/torchbuild/lib/python3.10/site-packages/torch/cuda/init.py:173: UserWarning: NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86. If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) PyTorch version: 2.0.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.27.5 Libc version: glibc-2.35

Python version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-83-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 PCIe Nvidia driver version: 535.104.05 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: AuthenticAMD Model name: AMD EPYC 9334 32-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 1 Core(s) per socket: 24 Socket(s): 1 Stepping: 1 BogoMIPS: 5399.98 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm arch_capabilities Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 1.5 MiB (24 instances) L1i cache: 1.5 MiB (24 instances) L2 cache: 12 MiB (24 instances) L3 cache: 384 MiB (24 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-23 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.0 [pip3] optree==0.9.2 [pip3] pytorch-triton==2.1.0+6e4932cda8 [pip3] torch==2.0.1 [pip3] triton==2.0.0 [pip3] triton-nightly==2.1.0.dev20230822000928 [conda] blas 1.0 mkl [conda] magma-cuda121 2.6.1 1 pytorch [conda] mkl 2023.1.0 h213fc3f_46343 [conda] mkl-include 2023.1.0 h06a4308_46343 [conda] mkl-service 2.4.0 py311h5eee18b_1 [conda] mkl_fft 1.3.6 py311ha02d727_1 [conda] mkl_random 1.2.2 py311ha02d727_1 [conda] numpy 1.24.3 py311h08b1b3b_1 [conda] numpy-base 1.24.3 py311hf175353_1 [conda] numpydoc 1.5.0 py311h06a4308_0 [conda] pytorch 2.0.1 cpu_py311h6d93b4c_0

abhishekkrthakur commented 1 year ago

Does the dockerfile work or still the same error?

mathis-lambert commented 1 year ago

@abhishekkrthakur I'll try asap, and i will keep you in touch. Thanks

mathis-lambert commented 1 year ago

Okay, here's wha worked for me to use H100 Gpu capabilities :

export TORCH_CUDA_ARCH_LIST="9.0"

Build Magma from source :


git clone --single-branch --branch v2.7.1 --depth 1 https://bitbucket.org/icl/magma.git
cd magma

echo -e "GPU_TARGET = sm_86\nBACKEND = cuda\nFORT = false" > make.inc make generate

export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib:/usr/local/cuda/targets/x86_64-linux/lib${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}" export CUDA_DIR="/usr/local/cuda/12.2" export CONDA_LIB=${CONDA_PREFIX}/lib

be careful here; they didn't accept sm_89 so I had to round it down to major version, sm_80

make clean && rm -rf build/

TARGETARCH=amd64 cmake -H. -Bbuild -DUSE_FORTRAN=OFF -DGPU_TARGET="Ampere" -DBUILD_SHARED_LIBS=OFF -DBUILD_STATIC_LIBS=ON -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_C_FLAGS="-fPIC" -DMKLROOT=${CONDA_PREFIX} -DCUDA_NVCC_FLAGS="-Xfatbin;-compress-all;-DHAVE_CUBLAS;-std=c++11;--threads=0;" -GNinja

sudo mkdir /usr/local/magma/

sudo cmake --build build -j $(nproc) --target install

sudo cp build/include/ /usr/local/magma/include/ sudo cp build/lib/.so /usr/local/magma/lib/ sudo cp build/lib/.a /usr/local/magma/lib/ sudo cp build/lib/pkgconfig/.pc /usr/local/magma/lib/pkgconfig/ sudo cp /usr/local/magma/include/ ${CONDA_PREFIX}/include/ sudo cp /usr/local/magma/lib/.a ${CONDA_PREFIX}/lib/ sudo cp /usr/local/magma/lib/.so ${CONDA_PREFIX}/lib/ sudo cp /usr/local/magma/lib/pkgconfig/.pc ${CONDA_PREFIX}/lib/pkgconfig/

- Build [PyTorch from source](https://github.com/pytorch/pytorch#from-source)
- Build [Xformers from source](https://github.com/facebookresearch/xformers#installing-xformers)
```bash
# (Optional) Makes the build much faster
pip install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
# (this can take dozens of minutes)

After that :

verify your installation with :

pip show torch
pip show xformers

and check if the version is followed by the git repo.

Then, you can type :

python -m torch.utils.collect_env

and check if the CUDA version used to build torch is the same as your current CUDA Version.

Finally, when after launched :

autotrain setup

(you might reinstall torch & xformers after the setup via pip)

If you have any questions, tell me.

huggingface / autotrain-advanced

H100 Compatibility - PyTorch Issues #281

Context

First try

What i tried to handle it

Conclusion

Also my env information :

be careful here; they didn't accept sm_89 so I had to round it down to major version, sm_80

After that :