karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
24.32k stars 2.74k forks source link

Assertion `graph->check_support(cudnn_handle).is_good()' failed #366

Open wfoy opened 6 months ago

wfoy commented 6 months ago

I'm getting the following error when running ./train_gpt2cu after building using make train_gpt2cu USE_CUDNN=1

allocated 237 MiB for model parameters
allocated 1703 MiB for activations
train_gpt2cu: train_gpt2.cu:582: auto lookup_cache_or_build_graph_fwd(Args ...) [with Args = {int, int, int, int, bool}]: Assertion `graph->check_support(cudnn_handle).is_good()' failed.
[ip-172-31-71-31:07018] *** Process received signal ***
[ip-172-31-71-31:07018] Signal: Aborted (6)
[ip-172-31-71-31:07018] Signal code:  (-6)
[ip-172-31-71-31:07018] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7d3327442520]
[ip-172-31-71-31:07018] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7d33274969fc]
[ip-172-31-71-31:07018] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7d3327442476]
[ip-172-31-71-31:07018] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7d33274287f3]
[ip-172-31-71-31:07018] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7d332742871b]
[ip-172-31-71-31:07018] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7d3327439e96]
[ip-172-31-71-31:07018] [ 6] ./train_gpt2cu(+0xc09cb)[0x5f73a90349cb]
[ip-172-31-71-31:07018] [ 7] ./train_gpt2cu(+0x2a0a2)[0x5f73a8f9e0a2]
[ip-172-31-71-31:07018] [ 8] ./train_gpt2cu(+0x2b543)[0x5f73a8f9f543]
[ip-172-31-71-31:07018] [ 9] ./train_gpt2cu(+0x15a64)[0x5f73a8f89a64]
[ip-172-31-71-31:07018] [10] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7d3327429d90]
[ip-172-31-71-31:07018] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7d3327429e40]
[ip-172-31-71-31:07018] [12] ./train_gpt2cu(+0x177e5)[0x5f73a8f8b7e5]
[ip-172-31-71-31:07018] *** End of error message ***
[1]    7018 IOT instruction (core dumped)  ./train_gpt2cu

I'm running CUDA 12.4 on Ubuntu 22.04 Any help or pointers would be great, thanks!

Anerudhan commented 6 months ago

Can you add which GPU device and cudnn version? A log with CUDNN_LOGLEVEL_DBG=3 will be useful for debug as well.

https://docs.nvidia.com/deeplearning/cudnn/latest/reference/troubleshooting.html

wfoy commented 6 months ago

Fixed by upgrading cuDNN version, previously was on 8.9.2 which broke with above error

ifromeast commented 5 months ago

After compiling by make train_gpt2cu USE_CUDNN=1, and run ./train_gpt2cu, there is an ERROR that

+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| input dataset prefix  | data/tiny_shakespeare                              |
| output log file       | NULL                                               |
| batch size B          | 4                                                  |
| sequence length T     | 1024                                               |
| learning rate         | 3.000000e-04                                       |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_batches       | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
+-----------------------+----------------------------------------------------+
| device                | NVIDIA GeForce RTX 4090                            |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
| load_filename         | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 74                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| num_processes         | 1                                                  |
+-----------------------+----------------------------------------------------+
num_parameters: 124475904 ==> bytes: 248951808
allocated 237 MiB for model parameters
allocated 1703 MiB for activations
[CUDNN ERROR] at file cudnn_att.cpp:141:
[cudnn_frontend] Error: No execution plans built successfully.

and my CUDA is 12.4, cuDNN is 9.1, cudnn-frontend is 1.4.0 on Ubuntu 22.04

Anerudhan commented 5 months ago

Hi @ifromeast

Is it possible for you to dump the cudnn log?

If you set export CUDNN_LOGLEVEL_DBG=3 and it will dumped to your stdout.

The log will look like something like this:

I! CuDNN (v90100 70) function cudnnCreate() called:
i!     handle: location=host; addr=0x563c9b4a01a0;
i! Time: 2024-05-13T07:49:20.051230 (0d+0h+0m+0s since start)
i! Process=975; Thread=975; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90100 70) function cudnnGraphLibraryConfigInit() called:
i!     apiLog: type=cudnnLibConfig_t; val=CUDNN_STANDARD;
i! Time: 2024-05-13T07:49:20.051266 (0d+0h+0m+0s since start)
i! Process=975; Thread=975; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90100 70) function cudnnGetVersion() called:
i! Time: 2024-05-13T07:49:20.216976 (0d+0h+0m+0s since start)
i! Process=975; Thread=975; GPU=NULL; Handle=NULL; StreamId=NULL.

I am able to run the exact some configuration locally.

+-----------------------+----------------------------------------------------+
| Parameter               | Value                                           |
+-----------------------+----------------------------------------------------+
| input dataset prefix    | data/tiny_shakespeare                              |
| output log file       | NULL                                               |
| batch size B          | 4                                                  |
| sequence length T     | 1024                                               |
| learning rate         | 3.000000e-04                                       |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_batches       | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
+-----------------------+----------------------------------------------------+
| device                | NVIDIA GeForce RTX 4090                            |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
| load_filename         | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 74                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| num_processes         | 1                                                  |
+-----------------------+----------------------------------------------------+
num_parameters: 124475904 ==> bytes: 248951808
allocated 237 MiB for model parameters
allocated 1703 MiB for activations
val loss 4.505090
allocated 237 MiB for parameter gradients
allocated 30 MiB for activation gradients
allocated 474 MiB for AdamW optimizer state m
allocated 474 MiB for AdamW optimizer state v
allocated 474 MiB for master copy of params
step    1/74: train loss 4.370480 (acc 4.370480) (298.699646 ms, 13712.771484 tok/s)
step    2/74: train loss 4.502850 (acc 4.502850) (34.138111 ms, 119983.187500 tok/s)
step    3/74: train loss 4.414629 (acc 4.414629) (34.011135 ms, 120212.890625 tok/s)
step    4/74: train loss 3.958204 (acc 3.958204) (34.105343 ms, 120172.781250 tok/s)
step    5/74: train loss 3.607100 (acc 3.607100) (34.020351 ms, 120233.632812 tok/s)
step    6/74: train loss 3.782271 (acc 3.782271) (34.085888 ms, 120218.898438 tok/s)
ifromeast commented 5 months ago

Hi @Anerudhan , Thank you so much for your advice to print log, and I got

E! CuDNN (v90101 17) function cudnnBackendFinalize() called:
e!     Info: Traceback contains 4 message(s)
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: rtc->loadModule()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: ptr.isSupported()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: engine_post_checks(*engine_iface, engine.getPerfKnobs(), req_size, engine.getTargetSMCount())
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: finalize_internal()
e! Time: 2024-05-13T16:09:12.824746 (0d+0h+0m+2s since start)
e! Process=781621; Thread=781621; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90101 17) function cudnnGetErrorString() called:
i!     status: type=int; val=5000;
i! Time: 2024-05-13T16:09:12.824882 (0d+0h+0m+2s since start)
i! Process=781621; Thread=781621; GPU=NULL; Handle=NULL; StreamId=NULL.

Do you know why it happens? I am new to CUDA, thank you so much!

Anerudhan commented 5 months ago

Could be a driver or toolkit issue. What version of driver are you on?

nvidia-smi
Mon May 13 08:31:45 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+

Update instructions: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network

ifromeast commented 5 months ago

Could be a driver or toolkit issue. What version of driver are you on?

nvidia-smi
Mon May 13 08:31:45 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+

Update instructions: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network

this is my driver version

Mon May 13 16:38:13 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:00:08.0 Off |                  Off |
| 30%   28C    P8             22W /  450W |      11MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:00:09.0 Off |                  Off |
| 30%   27C    P8             18W /  450W |      11MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
ifromeast commented 5 months ago

@Anerudhan cudnn-frontend have updated last week, have you update it?

ifromeast commented 5 months ago

similarly, the ERROR occurs when

(llm-env) root@ubuntu22:~/llm.c/dev/cuda# nvcc -I../../cudnn-frontend/include -DENABLE_CUDNN -O3 --use_fast_math -lcublas -lcublasLt -lcudnn attention_forward.cu -o attention_forwa
rd
(llm-env) root@ubuntu22:~/llm.c/dev/cuda# ./attention_forward 10
enable_tf32: 1
Using kernel 10
Checking block size 32.
attention_forward: attention_forward.cu:1143: auto lookup_cache_or_build_graph_fwd(Args ...) [with Args = {int, int, int, int, bool}]: Assertion `graph->check_support(cudnn_handle).is_good()' failed.

is there anything wrong with my cuDNN or cudnn-frontend?

Anerudhan commented 5 months ago

Hi @ifromeast, I am still trying to reproduce the issue (Yes I have the latest cudnn-frontend and cudnn).

This does not look like a cudnn issue. I suspect this happens because of multi-GPU setup.

Is it possible for you to try two scenarios: a) Try setting CUDA_VISIBLE_DEVICES=0,-1,1 and check if the execution is successful for you? b) (Indpendent of case above)Try setting CUDA_MODULE_LOADING=EAGER and CUDA_MODULE_DATA_LOADING=EAGER

Thanks Anerudhan

simonguozirui commented 5 months ago

I am having exact issue as @ifromeast. My cuda version is 12.4, CuDNN Version is 9.1.1.17-1, and cudnn-frontend is 1.4.0 on Debian 11.

Anerudhan commented 5 months ago

Hi @simonguozirui , Is it on multi-GPU 4090 as well?

Is it possible for you to try two scenarios: a) Try setting CUDA_VISIBLE_DEVICES=0,-1,1 and check if the execution is successful for you?

b) (Indpendent of case above)Try setting CUDA_MODULE_LOADING=EAGER and CUDA_MODULE_DATA_LOADING=EAGER

Thanks

simonguozirui commented 5 months ago

Hey @Anerudhan! Thanks so much for the suggestion. I tried both of those, but unfortunately doesn't change the behavior. I am on a T4 GPU (single GPU setup). Things break for me at graph->check_support(cudnn_handle) as well.

Curious which CuDNN and front end version are you using so I can reference and debug.

Anerudhan commented 5 months ago

I am using cudnn-frontend-1.4.0 and cuda 12.4 (I have cuda 12.3 installed as well for debugging).

I think the issue is cudnn sdpa operation is not supported on T4 (turing and requires Ampere or later GPUs). If you run with export CUDNN_LOGLEVEL_DBG=2, you will see more helpful messages.

Thanks

simonguozirui commented 5 months ago

@Anerudhan thanks I will try on an Ampere GPU too. With the new log level I see some messages like i! descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED; curious if you know what might be causing that.

Anerudhan commented 5 months ago

Those are info messages (i!) and harmless as they capture the library state. I would be more interested in messages which are warnings(w!) or errors(e!).

simonguozirui commented 5 months ago

Hi @Anerudhan, I checked. No errors e!, only one w!; here it is

W! CuDNN (v90101 17) function cudnnBackendFinalize() called:
w!     Info: Traceback contains 2 message(s)
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: userGraph->getEntranceNodesSize() != 2
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: numUserNodes != 5 && numUserNodes != 6
w! Time: 2024-05-16T18:12:27.288935 (0d+0h+0m+0s since start)
w! Process=349188; Thread=349188; GPU=NULL; Handle=NULL; StreamId=NULL.
h53 commented 5 months ago

same error with @ifromeast, btw I tested it on wsl.

(base) h53@Nyx:~/repo/llm.c$ make train_gpt2cu USE_CUDNN=1
---------------------------------------------
✓ cuDNN found, will run with flash-attention
✓ OpenMP found
✗ OpenMPI is not found, disabling multi-GPU support
---> On Linux you can try install OpenMPI with `sudo apt install openmpi-bin openmpi-doc libopenmpi-dev`
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda-12.3/bin/nvcc -O3 -t=0 --use_fast_math --generate-code arch=compute_89,code=[compute_89,sm_89] -DENABLE_CUDNN -DENABLE_BF16 train_gpt2.cu cudnn_att.o -lcublas -lcublasLt -lcudnn -I/home/h53/cudnn-frontend/include  -o train_gpt2cu 
(base) h53@Nyx:~/repo/llm.c$ ./train_gpt2cu 
Multi-GPU support is disabled. Using a single GPU.
+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern      | dev/data/tinyshakespeare/tiny_shakespeare_val.bin  |
| output log dir        | NULL                                               |
| checkpoint_every      | 0                                                  |
| resume                | 0                                                  |
| micro batch size B    | 4                                                  |
| sequence length T     | 1024                                               |
| total batch size      | 4096                                               |
| learning rate (LR)    | 3.000000e-04                                       |
| warmup iterations     | 0                                                  |
| final LR fraction     | 1.000000e+00                                       |
| weight decay          | 0.000000e+00                                       |
| grad_clip             | 1.000000e+00                                       |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_steps         | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
| recompute             | 1                                                  |
+-----------------------+----------------------------------------------------+
| device                | NVIDIA GeForce RTX 4060 Ti                         |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
| load_filename         | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 74                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| run hellaswag         | no                                                 |
+-----------------------+----------------------------------------------------+
| Zero Optimization is disabled                                              |
| num_processes         | 1                                                  |
| zero_stage            | 0                                                  |
+-----------------------+----------------------------------------------------+
HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation
You can run `python dev/data/hellaswag.py` to export and use it with `-h 1`.
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=4 * seq_len T=1024 * num_processes=1 and total_batch_size=4096
=> setting grad_accum_steps=1
allocating 1439 MiB for activations

W! CuDNN (v90101 17) function cudnnBackendFinalize() called:
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: userGraph->getEntranceNodesSize() != 2
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: numUserNodes != 5 && numUserNodes != 6
w! Time: 2024-05-28T07:01:07.380708 (0d+0h+0m+0s since start)
w! Process=210452; Thread=210452; GPU=NULL; Handle=NULL; StreamId=NULL.

E! CuDNN (v90101 17) function cudnnBackendFinalize() called:
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: rtc->loadModule()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: ptr.isSupported()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: engine_post_checks(*engine_iface, engine.getPerfKnobs(), req_size, engine.getTargetSMCount())
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: finalize_internal()
e! Time: 2024-05-28T07:01:07.741508 (0d+0h+0m+0s since start)
e! Process=210452; Thread=210452; GPU=NULL; Handle=NULL; StreamId=NULL.

E! CuDNN (v90101 17) function cudnnBackendFinalize() called:
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: rtc->loadModule()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: ptr.isSupported()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: engine_post_checks(*engine_iface, engine.getPerfKnobs(), req_size, engine.getTargetSMCount())
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: finalize_internal()
e! Time: 2024-05-28T07:01:07.960496 (0d+0h+0m+0s since start)
e! Process=210452; Thread=210452; GPU=NULL; Handle=NULL; StreamId=NULL.

E! CuDNN (v90101 17) function cudnnBackendFinalize() called:
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: rtc->loadModule()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: ptr.isSupported()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: engine_post_checks(*engine_iface, engine.getPerfKnobs(), req_size, engine.getTargetSMCount())
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: finalize_internal()
e! Time: 2024-05-28T07:01:08.192814 (0d+0h+0m+1s since start)
e! Process=210452; Thread=210452; GPU=NULL; Handle=NULL; StreamId=NULL.

[CUDNN ERROR] at file cudnn_att.cpp:141:
[cudnn_frontend] Error: No execution plans built successfully.
(base) h53@Nyx:~/repo/llm.c$ nvidia-smi
Tue May 28 07:02:59 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.65                 Driver Version: 551.86         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   37C    P8              8W /  165W |    1231MiB /  16380MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        33      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+
(base) h53@Nyx:~/repo/llm.c$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0
yangcheng commented 5 months ago

I have similar error , running in ubuntu 22.04

(base) ubuntu:~/llm.c$ nvidia-smi
Fri Jun  7 06:23:37 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB           Off |   00000000:00:1E.0 Off |                    0 |
| N/A   32C    P0             23W /  300W |       1MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The full log is too long with CUDNN_LOGLEVEL_DBG=3 , last few lines are:

I! CuDNN (v90101 17) function cudnnBackendGetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_ENGINEHEUR_RESULTS (202);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     requestedElementCount: type=int64_t; val=0;
i!     elementCount: location=host; addr=0x7ffcd475f490;
i!     arrayOfElements: location=host; addr=NULL_PTR;
i! Time: 2024-06-07T06:07:28.129271 (0d+0h+0m+3s since start)
i! Process=19625; Thread=19625; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90101 17) function cudnnBackendGetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_ENGINEHEUR_RESULTS (202);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     requestedElementCount: type=int64_t; val=0;
i!     elementCount: location=host; addr=0x7ffcd475f3d8;
i!     arrayOfElements: location=host; addr=NULL_PTR;
i! Time: 2024-06-07T06:07:28.129400 (0d+0h+0m+3s since start)
i! Process=19625; Thread=19625; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90101 17) function cudnnBackendDestroyDescriptor() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i! Time: 2024-06-07T06:07:28.129525 (0d+0h+0m+3s since start)
i! Process=19625; Thread=19625; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90101 17) function cudnnGetErrorString() called:
i!     status: type=int; val=0;
i! Time: 2024-06-07T06:07:28.129586 (0d+0h+0m+3s since start)
i! Process=19625; Thread=19625; GPU=NULL; Handle=NULL; StreamId=NULL.

[CUDNN ERROR] at file llmc/cudnn_att.cpp:112:
[cudnn_frontend] Error: No execution plans built successfully.
yangcheng commented 5 months ago

Fixed by upgrading cuDNN version, previously was on 8.9.2 which broke with above error

after upgrade to cuDNN 9.2.0 from 9.1.1, I got new error, which version are you using ? Thanks

I! CuDNN (v90101 17) function cudnnBackendGetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_ENGINEHEUR_RESULTS (202);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     requestedElementCount: type=int64_t; val=0;
i!     elementCount: location=host; addr=0x7ffcd475f3d8;
i!     arrayOfElements: location=host; addr=NULL_PTR;
i! Time: 2024-06-07T06:07:28.129400 (0d+0h+0m+3s since start)
i! Process=19625; Thread=19625; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90101 17) function cudnnBackendDestroyDescriptor() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i! Time: 2024-06-07T06:07:28.129525 (0d+0h+0m+3s since start)
i! Process=19625; Thread=19625; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90101 17) function cudnnGetErrorString() called:
i!     status: type=int; val=0;
i! Time: 2024-06-07T06:07:28.129586 (0d+0h+0m+3s since start)
i! Process=19625; Thread=19625; GPU=NULL; Handle=NULL; StreamId=NULL.

[CUDNN ERROR] at file llmc/cudnn_att.cpp:112:
[cudnn_frontend] Error: No execution plans built successfully.
ahnseunghae commented 1 month ago

@h53 @yangcheng I encountered the same issue. It only works on Ampere or later GPUs. Therefore, I switched from a V100 to an A100.