pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.16k stars 3.64k forks source link

MulAggregation produces NaNs on non-Nan and non-Inf values #8726

Open will-leeson opened 9 months ago

will-leeson commented 9 months ago

🐛 Describe the bug

When using the MulAggregation class, I receive nan values when I believe I shouldn't. I'm not quite sure how to produce a minimal example, but I can give information about some tensors that produce a nan values when given to a MulAggregation object. Starting with a (1319, 600) sized tensor, the MulAggregation layer produces a (1, 600) size tensor, 11 of the 600 values are nan values. To determine which are nan and discover some statics about them, I ran the following, where x is the output of my GNN :

pre_pool = x

print(torch.isnan(x).any())
print(torch.isinf(x).any())

x = self.pool(x, batch)

bad_dim = torch.where(torch.isnan(x)[0]

bad_slice = torch.index_select(pre_pool, 1, bad_dim)

print(bad_slice.size())
print(torch.max(bad_slice,dim=0)[0])
print(torch.min(bad_slice,dim=0)[0])
print(torch.sum(bad_slice,dim=0))

Here is the output

tensor(False)
tensor(False)
torch.Size([1319, 11])
tensor([ 3.4977,  4.6247, 16.2319,  3.3301,  4.4815,  8.6890, 11.1488,  8.2191,
        12.1529, 13.7837,  9.8832], grad_fn=<MaxBackward0>)
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], grad_fn=<MinBackward0>)
tensor([453.3732, 421.0027, 373.4271, 422.8755, 430.7148, 415.5174, 385.8845,
        419.5864, 370.6562, 343.8626, 386.3253], grad_fn=<SumBackward1>)

Note that there are no nan or inf values in the tensor before the pooling, as shown by the isinf and isnan checks. The maximum value in each dimension is relatively small, between 3 and 17. The minimum value in each dimension is 0, which makes sense as there is a relu after the GNN. The sum across each dimension ranges between 300 and 500.

These all seem relatively small, and whats stranger is I would expect the pool to produce all 0s instead of nans as the minimum value across each dimension is 0 and 0 times anything is 0. This is dependent on the data. About 1 in 10 of my data points seem to produce this. I'm sorry I couldn't produce a better minimal example. I can print out the entire value of the tensor that produces the issue if it helps.

Versions

Collecting environment information... PyTorch version: 2.1.0 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35

Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA TITAN RTX Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 1 Stepping: 4 CPU max MHz: 4500.0000 CPU min MHz: 1200.0000 BogoMIPS: 6599.98 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 320 KiB (10 instances) L1i cache: 320 KiB (10 instances) L2 cache: 10 MiB (10 instances) L3 cache: 13.8 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-19 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable

Versions of relevant libraries: [pip3] numpy==1.26.0 [pip3] torch==2.1.0 [pip3] torch_geometric==2.4.0 [pip3] torch-scatter==2.1.2 [pip3] torch-sparse==0.6.18 [pip3] torchaudio==2.1.0 [pip3] torchvision==0.16.0 [pip3] triton==2.1.0 [conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch [conda] mkl 2023.1.0 h213fc3f_46343
[conda] mkl-service 2.4.0 py311h5eee18b_1
[conda] mkl_fft 1.3.8 py311h5eee18b_0
[conda] mkl_random 1.2.4 py311hdb19cb5_0
[conda] numpy 1.26.0 py311h08b1b3b_0
[conda] numpy-base 1.26.0 py311hf175353_0
[conda] pytorch 2.1.0 py3.11_cuda12.1_cudnn8.9.2_0 pytorch [conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] pytorch-scatter 2.1.2 py311_torch_2.1.0_cu121 pyg [conda] pytorch-sparse 0.6.18 py311_torch_2.1.0_cu121 pyg [conda] torch-geometric 2.4.0 pypi_0 pypi [conda] torchaudio 2.1.0 py311_cu121 pytorch [conda] torchtriton 2.1.0 py311 pytorch [conda] torchvision 0.16.0 py311_cu121 pytorch

rusty1s commented 9 months ago

Thanks for reporting. Do you mind sharing your input data to reproduce this?

Also:

will-leeson commented 9 months ago

Do you mind sharing your input data to reproduce this?

I'm happy to share any data that helps. I'm attaching a tensor here that produces both nans and infs when put through a MulAggregation. Its dimension is (5126, 600) with a max value of 100 and a min value of 0, with no nans or infs. Let me know if you need any other data.

Does this happen both on CPU/GPU? Does it happen when you run ...

Yes. I tried both CPU and GPU and also tried both with and without the code you provided. Both produced NaNs

rusty1s commented 9 months ago

Mh, I can even reproduce this when using regular PyTorch prod():

import numpy as np
import torch

from torch_geometric.utils import scatter

x = np.load('bad_tensor.npz')['arr_0']
x = torch.from_numpy(x)

index = torch.zeros(5216, dtype=torch.long)
print(x.prod(dim=0).isnan().sum())
print(x.prod(dim=0).isinf().sum())
out = scatter(x, index, reduce='mul')
print(out.isnan().sum())
print(out.isinf().sum())
rusty1s commented 9 months ago

I think the issue is that prod is just unstable for such large reductions. In most cases, the sequential multiplication runs into inf, and inf * 0 results in NaN.

will-leeson commented 9 months ago

Ahh, I see. Is it worth bringing to the attention of the PyTorch people, or is this an "expected" issue?

rusty1s commented 9 months ago

I think this is expected behavior.