Precision Differences in Using `__add__`, `layer_norm` and `linear`

🐛 Describe the bug

Description:

I have encountered some precision differences when using the __add__, layer_norm and linear in the following way.

Code to Reproduce:

import torch.nn.functional as f
import torch

# Precision error is calculated based on the Chebyshev distance.

args = torch.load('__add__.pt')

output = torch.Tensor.__add__(args['parameter:0'], args['parameter:1'])

args = torch.load('layer_norm.pt')

output = f.layer_norm(output, args['parameter:1'], args['parameter:2'], args['parameter:3'], args['parameter:4'])

args = torch.load('linear.pt')

output = f.linear(output, args['parameter:1'], args['parameter:2'])

Precision Differences:

__add__:
- CPU: 88219.25781
- GPU: 88219.25781
- Precision Differences: 0.
layer_norm:
- CPU: 64.07816
- GPU: 64.07813
- Precision Differences: 3.05176e-05
linear:
- CPU: 140435.31250
- GPU: 140435.03125
- Precision Differences: 0.28125

System Info:

PyTorch Version: 2.3.1+cu121
OS: Ubuntu 20.04.3 LTS
Python version: Python 3.8.0
CUDA version: 12.1

Expected Behavior: The calculation results of every API are consistent across CPU and GPU, with precision differences less than 1e-3.

Actual Behavior: The CPU and GPU results of the third API linear show differences greater than the accepted threshold 1e-3.

Additional Context: 5200 pt files

Thank you for your attention to this matter. Please let me know if any further information is required.

Versions

PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31

Python version: 3.8.0 (default, Nov 6 2019, 21:49:08) [GCC 7.3.0] (64-bit runtime) Python platform: Linux-5.15.0-100-generic-x86_64-with-glibc2.10 Is CUDA available: True CUDA runtime version: 10.1.243 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce RTX 2080 Ti GPU 2: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 550.54.14 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 45 bits physical, 48 bits virtual CPU(s): 80 On-line CPU(s) list: 40,58,63,66 Off-line CPU(s) list: 0-39,41-57,59-62,64,65,67-79 Thread(s) per core: 0 Core(s) per socket: 40 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 106 Model name: Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz Stepping: 6 CPU MHz: 2199.999 BogoMIPS: 4399.99 L1d cache: 3.8 MiB L1i cache: 2.5 MiB L2 cache: 100 MiB L3 cache: 78 MiB NUMA node0 CPU(s): 0-39 NUMA node1 CPU(s): 40-79 Vulnerability Gather data sampling: Vulnerable: No microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities

Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] torch==2.3.1 [pip3] torchaudio==2.3.1 [pip3] torchvision==0.18.1 [pip3] triton==2.3.1 [conda] numpy 1.24.4 pypi_0 pypi [conda] torch 2.3.1 pypi_0 pypi [conda] torchaudio 2.3.1 pypi_0 pypi [conda] torchvision 0.18.1 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi

pytorch / pytorch