Value 'sm_89' is not defined for option 'gpu-name'

Hello all,

Summary

I am having issues running the xformers softmax function. Internally, it seems triton fails and the function is falling back to the torch implementation. Is there no support for NVIDIA RTX 4090 right now?

Environment Details

I'm using PyTorch 1.13.1 through Docker and CUDA 12, with an i9 13900k and a 4090.

$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.25.0
Libc version: glibc-2.27

Python version: 3.10.8 (main, Nov  4 2022, 13:48:29) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 525.78.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] botorch==0.8.0
[pip3] gpytorch==1.9.0
[pip3] numpy==1.22.3
[pip3] pytorch-lightning==1.8.6
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.13.1
[pip3] torch-optimizer==0.3.0
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==0.11.1
[pip3] torchtext==0.14.1
[pip3] torchvision==0.14.1
[conda] blas                      1.0                         mkl  
[conda] botorch                   0.8.0                    pypi_0    pypi
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] gpytorch                  1.9.0                    pypi_0    pypi
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0           py310h7f8727e_0  
[conda] mkl_fft                   1.3.1           py310hd6ae3a3_0  
[conda] mkl_random                1.2.2           py310h00e6091_0  
[conda] numpy                     1.22.3          py310hfa59a62_0  
[conda] numpy-base                1.22.3          py310h9585f30_0  
[conda] pytorch                   1.13.1          py3.10_cuda11.6_cudnn8.3.2_0    pytorch
[conda] pytorch-cuda              11.6                 h867d48c_1    pytorch
[conda] pytorch-lightning         1.8.6                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchmetrics              0.11.1                   pypi_0    pypi
[conda] torchtext                 0.14.1                    py310    pytorch
[conda] torchvision               0.14.1              py310_cu116    pytorch

Minimum Reproducible Example

>>> from xformers.triton import softmax as triton_softmax
>>> triton_softmax(torch.rand(2, 1000, device="cuda:0"))
2023-01-30 07:14:49 [warning  ] Triton softmax kernel register spillover or invalid image caught.Deactivating this kernel, please file an issue int the xFormers repository [xformers] backend=nccl rank=0 world_size=1
2023-01-30 07:14:49 [warning  ] Internal Triton PTX codegen error: ptxas fatal   : Value 'sm_89' is not defined for option 'gpu-name' [xformers] backend=nccl rank=0 world_size=1
tensor([[0.0007, 0.0013, 0.0010,  ..., 0.0008, 0.0008, 0.0009],
        [0.0010, 0.0007, 0.0009,  ..., 0.0006, 0.0007, 0.0007]],
       device='cuda:0')

original issue: https://github.com/facebookresearch/xformers/issues/659

Update: never mind this post, I overlooked that nanoGPT's compilation support requires PyTorch 2.0.0. I've attached notes on how to make nanoGPT work below for anyone hitting this via Google.

Original post:

Having a similar issue training https://github.com/karpathy/nanoGPT on an RTX 4090. Slightly different error message, perhaps due a difference in triton version. I compiled the latest main as of today within the latest nvidia docker container, nvcr.io/nvidia/pytorch:23.01-py3.

python train.py config/train_shakespeare_char.py
...
Initializing a new model from scratch
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
number of parameters: 10.65M
using fused AdamW: False
compiling the model... (takes a ~minute)
/usr/local/lib/python3.8/dist-packages/torch/nn/utils/stateless.py:44: UserWarning: functional_call was passed multiple values for tied weights. This behavior is deprecated and will be an error in future versions
  warnings.warn("functional_call was passed multiple values for tied weights. "
/usr/local/lib/python3.8/dist-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
/usr/local/lib/python3.8/dist-packages/torch/nn/utils/stateless.py:44: UserWarning: functional_call was passed multiple values for tied weights. This behavior is deprecated and will be an error in future versions
  warnings.warn("functional_call was passed multiple values for tied weights. "
'sm_89' is not a recognized processor for this target (ignoring processor)
'sm_89' is not a recognized processor for this target (ignoring processor)

(Repeated many times.)

The compilation doesn't fail but training with the compiled model does fail (showing nan as a loss after the 1st iteration). With compile=False, the model trains and seems to operate normally.

python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.14.0a0+44dac51
Is debug build: False
CUDA used to build PyTorch: 12.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.24.1
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 14 2022, 12:59:47)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 12.0.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 525.78.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.14.0a0+44dac51
[pip3] torch-tensorrt==1.4.0.dev0
[pip3] torchtext==0.13.0a0+fae8e8c
[pip3] torchvision==0.15.0a0
[conda] Could not collect

(To the original poster: this is probably not relevant but just to point out you say you're using CUDA 12 but your logs say CUDA runtime version: 11.6.124.)

Workaround:

So as noted in the update you actually need pytorch 2.0.0.

# use cuda 11.8 version of nvidia's docker image, e.g.
cd nanoGPT
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v ${PWD}:/ext nvcr.io/nvidia/pytorch:22.12-py3
# upgrade it to use pytorch 2.0.0 from nightly builds
pip uninstall torch torchvision
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

# use latest triton
cd
git clone https://github.com/openai/triton.git;
cd triton/python;
pip install cmake; # build time dependency
pip install -e .

# also needed for nanoGPT
pip install tiktoken

python train.py config/train_shakespeare_char.py --device=cuda --compile=True

Now you'll get this error: PTX .version 7.4 does not support .target sm_89

From here I don't know. I'm not sure PyTorch is using the latest version of Triton installed as per the above, I saw some mention of it bundling a pinned version.

According to this post, torch.compile does not support the Ada Lovelace architecture using the default backend, but you can switch to another one. In nanoGPT's train.py, change model = torch.compile(model) # requires PyTorch 2.0 to model = torch.compile(model, backend='nvprims_aten') # requires PyTorch 2.0.

This version runs much more slowly than not compiling at all, though. So you can just try,

python train.py config/train_shakespeare_char.py --device=cuda --compile=False

Which trains in 9 minutes for me (vs 20 minutes with the compiled version).

triton-lang / triton