pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
80.83k stars 21.7k forks source link

`torch.cuda.memory_summary()` can give `KeyError` #117130

Open Jasha10 opened 6 months ago

Jasha10 commented 6 months ago

🐛 Describe the bug

Calling torch.cuda.memory_summary() can give a KeyError under certain circumstances.

$ python
Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch.cuda
>>> torch.cuda.memory_summary()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/homestar/miniconda3/envs/utils/lib/python3.10/site-packages/torch/cuda/memory.py", line 496, in memory_summary
    current = stats[prefix + "current"]
KeyError: 'allocated_bytes.all.current'
>>> torch.cuda.memory_stats()
OrderedDict()
>>>

The root cause here is that torch.cuda.memory_stats() (which is used internally by torch.cuda.memory_summary()) has returned an object that does not have the expected keys.

Versions

$ python collect_env.py
Collecting environment information...
PyTorch version: 2.1.1
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA T1000 8GB
GPU 2: NVIDIA GeForce RTX 3090

Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             64
On-line CPU(s) list:                0-63
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              8
Model name:                         AMD Ryzen Threadripper PRO 5975WX 32-Cores
Stepping:                           2
Frequency boost:                    enabled
CPU MHz:                            1800.000
CPU max MHz:                        7006.6401
CPU min MHz:                        1800.0000
BogoMIPS:                           7186.26
Virtualization:                     AMD-V
L1d cache:                          1 MiB
L1i cache:                          1 MiB
L2 cache:                           16 MiB
L3 cache:                           128 MiB
NUMA node0 CPU(s):                  0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm

Versions of relevant libraries:
[pip3] autoflake8==0.4.1
[pip3] flake8==6.1.0
[pip3] mypy==1.7.1
[pip3] mypy-extensions==1.0.0
[pip3] mypy-protobuf==3.5.0
[pip3] numpy==1.22.4
[pip3] pytorch-warmup==0.1.1
[pip3] torch==2.1.1
[pip3] torchdata==0.7.1
[pip3] torchvision==0.16.1
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libblas                   3.9.0            16_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            16_linux64_mkl    conda-forge
[conda] libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
[conda] liblapack                 3.9.0            16_linux64_mkl    conda-forge
[conda] mkl                       2022.2.1         h84fe81f_16997    conda-forge
[conda] numpy                     1.22.4          py310h4ef5377_0    conda-forge
[conda] pytorch                   2.1.1           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-warmup            0.1.1                    pypi_0    pypi
[conda] torchdata                 0.7.1                     py310    pytorch
[conda] torchtriton               2.1.0                     py310    pytorch
[conda] torchvision               0.16.1              py310_cu121    pytorch

cc @ptrblck

tringwald commented 6 months ago

I can reliably reproduce this when calling torch.cuda.memory_summary() with a device int or str (e.g. cuda:0). The problem is, that in that case _lazy_init() is never called. Some other functions like torch.cuda.reset_peak_memory_stats('cuda:0') are affected by this too.

this commented 1 month ago

This issue is related to https://github.com/pytorch/pytorch/issues/49952. Unfortunately PR https://github.com/pytorch/pytorch/pull/51179 didn't fix it in the memory_summary function.

tringwald commented 1 month ago

Seems like https://github.com/pytorch/pytorch/pull/117143 was never merged and closed as stale.