Closed jp1924 closed 1 year ago
this should be defined in a file called cupti_runtime_cbid.h
somewhere in cuda includes. I think that it should be guarded by the #if defined(CUPTI_API_VERSION) && CUPTI_API_VERSION >= 17
on the previous line. Can you check whether the cupti_version.h
matches? It might be possible that the CUPTI_API_VERSION should be increased or decreased by 1.
Thanks for the answer @davidberard98!
I realized that in cupti_version.h
, the CUPTI_API_VERSION
is 17.
And is CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernelExC_v11060
defined in cupti_runtime_cbid.h
? You can compare to, say, https://gitlab.com/nvidia/headers/cuda-individual/cupti/-/blob/main/cupti_runtime_cbid.h?ref_type=heads - what's the last ID defined in your copy of cupti_runtime_cbid.h
?
My understanding was that CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernelExC_v11060
should be available in versions >=17 - but maybe this isn't true?
I can confirm compiling torch (branch: release/2.1) will fail due to this issue when using CUDA 11.7 but works fine when compiling with CUDA 11.8.
Sorry for the late reply! @davidberard98
I compared my cupti_runtime_cbid.h
with the cupti_runtime_cbid.h
in the CUPTI repo as you suggested, and they are different.
I think it's because my version of CUPTI is not up to date.
The environment I'm working in is in a container built on the nvidia/cuda:11.7.1-cudnn8-devel-ubuntu22.04
image, and while I'm working in the container, something is out of whack and I think I'm running an older version of CUPTI.
Thanks for the help!
this is my cupti_runtime_cbid.h
// *************************************************************************
// Definitions of indices for API functions, unique across entire API
// *************************************************************************
// This file is generated. Any changes you make will be lost during the next clean build.
// CUDA public interface, for type definitions and cu* function prototypes
typedef enum CUpti_runtime_api_trace_cbid_enum {
CUPTI_RUNTIME_TRACE_CBID_INVALID = 0,
CUPTI_RUNTIME_TRACE_CBID_cudaDriverGetVersion_v3020 = 1,
CUPTI_RUNTIME_TRACE_CBID_cudaRuntimeGetVersion_v3020 = 2,
CUPTI_RUNTIME_TRACE_CBID_cudaGetDeviceCount_v3020 = 3,
CUPTI_RUNTIME_TRACE_CBID_cudaGetDeviceProperties_v3020 = 4,
...
CUPTI_RUNTIME_TRACE_CBID_cudaGraphNodeSetEnabled_v11060 = 426,
CUPTI_RUNTIME_TRACE_CBID_cudaGraphNodeGetEnabled_v11060 = 427,
CUPTI_RUNTIME_TRACE_CBID_cudaArrayGetMemoryRequirements_v11060 = 428,
CUPTI_RUNTIME_TRACE_CBID_cudaMipmappedArrayGetMemoryRequirements_v11060 = 429,
CUPTI_RUNTIME_TRACE_CBID_SIZE = 430,
CUPTI_RUNTIME_TRACE_CBID_FORCE_INT = 0x7fffffff
} CUpti_runtime_api_trace_cbid;
CUpti_runtime_api_trace_cbid_enum
is only up to 430.
@davidberard98 FYI enum>430 is only added after CUDA 11.8 (diffs), so the dependency here blocks newer PyTorch builds on CUDA 11.7.
@xflash96 do you know the relationship between cuda version, cupti version, and number-of-enum-values? The reason we changed this to check cupti version instead of cuda version is because we had reports of people with cuda version 11.8 and a very old cupti version that didn't have this enum; we thought that cupti version would more accurately correspond to the available enum values, but it seems like this isn't the case..
@davidberard98 For CUPTI versioning vs CUDA[toolkit] versioning, v17 corresponding to CUDA 11.6 according to cupti_version.h. The enum should be increased wrt the toolkit if the toolkit is installed as a set. I guess most CI depends on NVidia's docker image for CUDA versioning, so it might be serve as a ground truth for what's included.. At least the current method doesn't work for the NVidia CUDA 11.7.1 docker image.
https://github.com/pytorch/kineto/issues/809 is changing it to v18
seems like the main branch of PyTorch still couldn't build without manually changing /home/user/pytorch/third_party/kineto/libkineto/src/CuptiActivity.cpp:247 to 18. I'm using CUDA 12.0.
kineto pin hasn't been updated in pytorch yet
hi pytorch team,
When i build the pytorch from source, i encountered the following error.
When I replace CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernelExC_v11060 with CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernel_v7000, Ninja operates normally.
Do you know why this is happening?
That code is modified in the following #788
Env
OS: 22.04 cuda: 11.7