Open Pedrexus opened 1 year ago
Update: never mind this post, I overlooked that nanoGPT's compilation support requires PyTorch 2.0.0. I've attached notes on how to make nanoGPT work below for anyone hitting this via Google.
Original post:
Having a similar issue training https://github.com/karpathy/nanoGPT on an RTX 4090. Slightly different error message, perhaps due a difference in triton version. I compiled the latest main as of today within the latest nvidia docker container, nvcr.io/nvidia/pytorch:23.01-py3
.
python train.py config/train_shakespeare_char.py
...
Initializing a new model from scratch
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
number of parameters: 10.65M
using fused AdamW: False
compiling the model... (takes a ~minute)
/usr/local/lib/python3.8/dist-packages/torch/nn/utils/stateless.py:44: UserWarning: functional_call was passed multiple values for tied weights. This behavior is deprecated and will be an error in future versions
warnings.warn("functional_call was passed multiple values for tied weights. "
/usr/local/lib/python3.8/dist-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
warnings.warn(message, UserWarning)
/usr/local/lib/python3.8/dist-packages/torch/nn/utils/stateless.py:44: UserWarning: functional_call was passed multiple values for tied weights. This behavior is deprecated and will be an error in future versions
warnings.warn("functional_call was passed multiple values for tied weights. "
'sm_89' is not a recognized processor for this target (ignoring processor)
'sm_89' is not a recognized processor for this target (ignoring processor)
(Repeated many times.)
The compilation doesn't fail but training with the compiled model does fail (showing nan
as a loss after the 1st iteration). With compile=False
, the model trains and seems to operate normally.
python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.14.0a0+44dac51
Is debug build: False
CUDA used to build PyTorch: 12.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.24.1
Libc version: glibc-2.31
Python version: 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 12.0.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 525.78.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.7.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.7.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.14.0a0+44dac51
[pip3] torch-tensorrt==1.4.0.dev0
[pip3] torchtext==0.13.0a0+fae8e8c
[pip3] torchvision==0.15.0a0
[conda] Could not collect
(To the original poster: this is probably not relevant but just to point out you say you're using CUDA 12 but your logs say CUDA runtime version: 11.6.124
.)
Workaround:
So as noted in the update you actually need pytorch 2.0.0.
# use cuda 11.8 version of nvidia's docker image, e.g.
cd nanoGPT
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v ${PWD}:/ext nvcr.io/nvidia/pytorch:22.12-py3
# upgrade it to use pytorch 2.0.0 from nightly builds
pip uninstall torch torchvision
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
# use latest triton
cd
git clone https://github.com/openai/triton.git;
cd triton/python;
pip install cmake; # build time dependency
pip install -e .
# also needed for nanoGPT
pip install tiktoken
python train.py config/train_shakespeare_char.py --device=cuda --compile=True
Now you'll get this error: PTX .version 7.4 does not support .target sm_89
From here I don't know. I'm not sure PyTorch is using the latest version of Triton installed as per the above, I saw some mention of it bundling a pinned version.
According to this post, torch.compile
does not support the Ada Lovelace architecture using the default backend, but you can switch to another one. In nanoGPT's train.py
, change model = torch.compile(model) # requires PyTorch 2.0
to model = torch.compile(model, backend='nvprims_aten') # requires PyTorch 2.0
.
This version runs much more slowly than not compiling at all, though. So you can just try,
python train.py config/train_shakespeare_char.py --device=cuda --compile=False
Which trains in 9 minutes for me (vs 20 minutes with the compiled version).
'sm_89' is not a recognized processor for this target (ignoring processor)
IIRC that compiler message is because LLVM 14 (which Triton uses at the moment) doesn't support Ada or Hopper architectures. LLVM 15 does, so it should go away once https://github.com/openai/triton/pull/1070 is merged.
Is this still an issue? I see https://github.com/openai/triton/pull/1070 was merged.
Hello all,
Summary
I am having issues running the xformers softmax function. Internally, it seems triton fails and the function is falling back to the torch implementation. Is there no support for NVIDIA RTX 4090 right now?
Environment Details
I'm using PyTorch 1.13.1 through Docker and CUDA 12, with an i9 13900k and a 4090.
Minimum Reproducible Example
original issue: https://github.com/facebookresearch/xformers/issues/659