serious AMD problems on some specific hardware

stas00 commented 3 years ago

I'm trying my latest HF/transformers deepspeed tests on 4 different machines:

Speed	CPU	RAM	GPUs	CUDA
242.45s	Intel i9	128GB	2xRTX TITAN	11.1
259.78s	Intel i7	128GB	RTX3090+GTX1070	11.1
1h+	AMD Ryzen 9	64GB	2xRTX TITAN	10.2
crashes	AMD Ryzen 9	64GB	2xRTX TITAN	10.2
420s	AMD EPYC-Rome Processor	512GB	2xA100	11.0

AMD Ryzen 9 is either taking forever or it crashes.

The machine of the last entry spews a ton of these:

Message from syslogd@badass at Apr  3 04:50:06 ...
 kernel:[2615390.401038] watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [pytest:3756011]

and then crashes.

OK the other difference is CUDA versions.

The tests are very light - doing very tiny batches for just a few iterations. So far from being stressed out - the gpus are mostly idle. I don't think the RAM difference is of any difference either.

I'm using the release candidate branch that @jeffra made: https://github.com/microsoft/DeepSpeed/tree/multi-z3-prs But it has been like this for a while now - I originally thought it was just some odd issue with this one machine, but now I'm seeing an identical problem with another identical machine.

OK, dmesg has a ton of these:

[Sat Apr  3 05:23:18 2021] [UFW BLOCK] IN=enp6s0 OUT= MAC=01:00:5e:00:00:01:cc:40:d0:0d:7c:cc:08:00 SRC=0.0.0.0 DST=224.0.0.1 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF PROTO=2 
[Sat Apr  3 05:23:26 2021] [UFW BLOCK] IN=enp6s0 OUT= MAC=01:00:5e:00:00:fb:24:4b:fe:de:96:71:08:00 SRC=192.168.1.14 DST=224.0.0.251 LEN=32 TOS=0x00 PREC=0xC0 TTL=1 ID=0 DF PROTO=2 
[Sat Apr  3 05:23:42 2021] watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [pytest:25222]
[Sat Apr  3 05:23:42 2021] Modules linked in: xt_recent bluetooth ecdh_generic ecc xt_nat veth xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat aufs binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua edac_mce_amd snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec ucsi_ccg typec_ucsi snd_hda_core typec snd_hwdep kvm snd_pcm snd_timer snd irqbypass soundcore k10temp eeepc_wmi asus_wmi ccp sparse_keymap video mxm_wmi wmi_bmof mac_hid nvidia_uvm(OE) nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit xt_addrtype xt_tcpudp sch_fq_codel xt_conntrack nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat overlay nf_conntrack_ftp ip6table_filter ip6_tables nf_conntrack br_netfilter nf_defrag_ipv6 bridge nf_defrag_ipv4 stp llc iptable_filter arp_tables bpfilter ip_tables x_tables autofs4 btrfs zstd_compress
[Sat Apr  3 05:23:42 2021]  raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd drm_kms_helper syscopyarea sysfillrect cryptd sysimgblt glue_helper fb_sys_fops igb i2c_piix4 drm ahci dca i2c_nvidia_gpu i2c_algo_bit libahci gpio_amdpt wmi gpio_generic
[Sat Apr  3 05:23:42 2021] CPU: 7 PID: 25222 Comm: pytest Tainted: P        W  OEL    5.3.0-64-generic #58-Ubuntu
[Sat Apr  3 05:23:42 2021] Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5204 07/29/2019
[Sat Apr  3 05:23:42 2021] RIP: 0010:fetch_pte.isra.0+0x5c/0x160
[Sat Apr  3 05:23:42 2021] Code: 01 48 89 d0 44 8d 14 ff 41 8d 4a 0c 48 89 e5 48 d3 e8 53 48 8b 36 25 ff 01 00 00 4c 8d 04 c6 b8 01 00 00 00 48 d3 e0 49 89 01 <85> ff 0f 8e 87 00 00 00 41 8d 4a 03 48 63 ff 49 bb 00 f0 ff ff ff
[Sat Apr  3 05:23:42 2021] RSP: 0018:ffffa8a7c37b79a8 EFLAGS: 00000216 ORIG_RAX: ffffffffffffff13
[Sat Apr  3 05:23:42 2021] RAX: 0000008000000000 RBX: 0000000000001000 RCX: 0000000000000027
[Sat Apr  3 05:23:42 2021] RDX: 00008b1d31177000 RSI: ffff9b63e7311000 RDI: 0000000000000003
[Sat Apr  3 05:23:42 2021] RBP: ffffa8a7c37b79b0 R08: ffff9b63e73118b0 R09: ffffa8a7c37b79c0
[Sat Apr  3 05:23:42 2021] R10: 000000000000001b R11: 000ffffffffff000 R12: ffff9b7365c8f098
[Sat Apr  3 05:23:42 2021] R13: ffff9b7365c8f094 R14: 0000000000000000 R15: 00008b1d31177000
[Sat Apr  3 05:23:42 2021] FS:  00007f7e9462e740(0000) GS:ffff9b737e7c0000(0000) knlGS:0000000000000000
[Sat Apr  3 05:23:42 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sat Apr  3 05:23:42 2021] CR2: 00007fb1cf47e4c0 CR3: 0000000fe1c42000 CR4: 0000000000340ee0
[Sat Apr  3 05:23:42 2021] Call Trace:
[Sat Apr  3 05:23:42 2021]  iommu_unmap_page+0x78/0x100
[Sat Apr  3 05:23:42 2021]  __unmap_single.isra.0+0x5f/0x110
[Sat Apr  3 05:23:42 2021]  unmap_sg+0x5f/0x70
[Sat Apr  3 05:23:42 2021]  nv_unmap_dma_map_scatterlist+0x59/0xa0 [nvidia]
[Sat Apr  3 05:23:42 2021]  nv_dma_unmap_pages+0x56/0x130 [nvidia]
[Sat Apr  3 05:23:42 2021]  nv_dma_unmap_alloc+0x14/0x30 [nvidia]
[Sat Apr  3 05:23:42 2021]  _nv030381rm+0xd4/0x220 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv025113rm+0xce/0x100 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv007135rm+0x29/0x40 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv026097rm+0x7f/0x2a0 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv026186rm+0x2cb/0x990 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv002863rm+0x9/0x20 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv003272rm+0x1b/0x80 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv010845rm+0x479/0x4e0 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv034900rm+0x99/0x110 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv034899rm+0x391/0x500 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv033551rm+0xd7/0x1a0 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv033552rm+0x42/0x70 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv007214rm+0x4b/0x90 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? os_acquire_spinlock+0x12/0x20 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? _nv000743rm+0x539/0x970 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? hrtimer_try_to_cancel+0x86/0x110
[Sat Apr  3 05:23:42 2021]  ? __check_object_size+0xf1/0x150
[Sat Apr  3 05:23:42 2021]  ? nvidia_ioctl+0x5b1/0x8a0 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
[Sat Apr  3 05:23:42 2021]  ? do_vfs_ioctl+0x407/0x670
[Sat Apr  3 05:23:42 2021]  ? do_futex+0x160/0x1e0
[Sat Apr  3 05:23:42 2021]  ? ksys_ioctl+0x67/0x90
[Sat Apr  3 05:23:42 2021]  ? __x64_sys_ioctl+0x1a/0x20
[Sat Apr  3 05:23:42 2021]  ? do_syscall_64+0x5a/0x130
[Sat Apr  3 05:23:42 2021]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Here is one of the problematic machines:

$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch']
torch version .................... 1.8.1
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/stas/hf/DeepSpeed/deepspeed']
deepspeed info ................... 0.3.13+74902d9, 74902d9, multi-z3-prs
deepspeed wheel compiled w. ...... torch 1.8, cuda 10.2

The problematic CPU on both machines:

lscpu 
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 9 3900X 12-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2195.585
CPU max MHz:                     3800.0000
CPU min MHz:                     2200.0000
BogoMIPS:                        7585.86
Virtualization:                  AMD-V
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        6 MiB
L3 cache:                        64 MiB
NUMA node0 CPU(s):               0-23
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP always-on, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxe
                                 xt fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq moni
                                 tor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy ab
                                 m sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx c
                                 pb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx sma
                                 p clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irp
                                 erf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter
                                 pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Also I used to run deepspeed just fine on this machine a few months ago with the same cuda-10.2. Could it be related to changes introduced by https://github.com/microsoft/DeepSpeed/pull/735, since it was made for deepspeed segfaulting on AMD? edit: I reverted the changes from this PR, rebuilt and the problem is the same. So that PR didn't introduce this problem.

The problem happens both with master and also 0.3.13.

@jeffra, @RezaYazdaniAminabadi

stas00 commented 3 years ago

I tried to look for a similar traceback and found this: https://forums.developer.nvidia.com/t/kernel-call-trace-observed-when-calling-cudafreehost-cudahostalloc-for-buffers-on-amd-cpu-with-nvi/72930 which suggests a potential issue with the iommu controller.

But since the problem happens on 2 identical machines, this is odd. Could be a problem in both.

Meanwhile I also tried other pytorch versions and cuda versions (within 10.x) and saw no change (but don't have 11.x at the moment).

Google gets only a small handful of hits for: https://www.google.com/search?q=Call+Trace+%22iommu_unmap_page%22+%22__unmap_single.isra%22 So perhaps this is very specific to that particular machine setup.

stas00 commented 3 years ago

The problem is still there on yet another box - running transformers/deepspeed tests crashes the box:

AMD Ryzen 9 again.

ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [YES] ...... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sgugger/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/torch']
torch version .................... 1.8.1+cu111
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/sgugger/git/DeepSpeed/deepspeed']
deepspeed info ................... 0.4.0+ccc522c, ccc522c, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 9 3900X 12-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         3056.768
CPU max MHz:                     6078.5151
CPU min MHz:                     2200.0000
BogoMIPS:                        7588.61
Virtualization:                  AMD-V
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        6 MiB
L3 cache:                        64 MiB
NUMA node0 CPU(s):               0-23
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxe
                                 xt fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq moni
                                 tor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy ab
                                 m sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx c
                                 pb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed
                                 adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clz
                                 ero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists
                                  pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

stas00 commented 3 years ago

Update, 2 months later using now pytorch-1.8.1 (Still cuda-10.2) the problem reported in OP on 2 identical machines is no more. All tests pass w/o crashing the machine. It's hard to tell whether something changed in deepspeed or pytorch, since the rest of the environment hasn't changed. But that's not true - I'm pretty sure the RAM was upped from 16 to 64GB. So this might be it.

The case with the most recent report https://github.com/microsoft/DeepSpeed/issues/925#issuecomment-853984722 is still there. This too is PyTorch version: 1.8.1 but cuda-11.1! And I managed to get the kernel trace (This is archlinux manjara whereas the previous ones were Ubuntu).:

Also this one has only 16GB RAM and it's already used quite a lot by the desktop apps. And the test suite runs around 13GB resident peak. So it's probably the lack of cgroups setup that crashes the system.

$ journalctl --since=today
Jun 08 18:37:23 brahms kernel: python3.7: page allocation failure: order:0, mode:0x50cc0(GFP_KERNEL|__GFP_NORETRY|__GFP_COMP), nodemask=(null),cpuset=/,mems_all>
Jun 08 18:37:23 brahms kernel: CPU: 7 PID: 582916 Comm: python3.7 Tainted: P           OE     5.10.36-2-MANJARO #1
Jun 08 18:37:23 brahms kernel: Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 2606 08/13/2020
Jun 08 18:37:23 brahms kernel: Call Trace:
Jun 08 18:37:23 brahms kernel:  dump_stack+0x6b/0x83
Jun 08 18:37:23 brahms kernel:  warn_alloc.cold+0x78/0xdc
Jun 08 18:37:23 brahms kernel:  __alloc_pages_slowpath.constprop.0+0xce8/0xd20
Jun 08 18:37:23 brahms kernel:  ? blk_attempt_plug_merge+0xb0/0xe0
Jun 08 18:37:23 brahms kernel:  ? __sbitmap_get_word+0x2a/0x80
Jun 08 18:37:23 brahms kernel:  ? sched_clock+0x5/0x10
Jun 08 18:37:23 brahms kernel:  ? sched_clock_cpu+0xc/0xb0
Jun 08 18:37:23 brahms kernel:  ? sched_clock+0x5/0x10
Jun 08 18:37:23 brahms kernel:  ? sched_clock_cpu+0xc/0xb0
Jun 08 18:37:23 brahms kernel:  __alloc_pages_nodemask+0x338/0x370
Jun 08 18:37:23 brahms kernel:  allocate_slab+0x341/0x4c0
Jun 08 18:37:23 brahms kernel:  ___slab_alloc+0x3cc/0x5a0
Jun 08 18:37:23 brahms kernel:  ? __uvm_kvmalloc+0x2c/0x80 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel:  ? uvm_va_range_create_external+0x2c/0xd0 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel:  ? kmem_cache_alloc+0x15c/0x2a0
Jun 08 18:37:23 brahms kernel:  ? __uvm_kvmalloc+0x2c/0x80 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel:  __slab_alloc.constprop.0+0x27/0x40
Jun 08 18:37:23 brahms kernel:  __kmalloc+0x2a5/0x2e0
Jun 08 18:37:23 brahms kernel:  __uvm_kvmalloc+0x2c/0x80 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel:  uvm_ioctl+0x825/0x1f30 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel:  ? update_load_avg+0x7e/0x630
Jun 08 18:37:23 brahms kernel:  ? sched_clock+0x5/0x10
Jun 08 18:37:23 brahms kernel:  uvm_unlocked_ioctl_entry+0x143/0x190 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel:  __x64_sys_ioctl+0x83/0xb0
Jun 08 18:37:23 brahms kernel:  do_syscall_64+0x33/0x40
Jun 08 18:37:23 brahms kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 08 18:37:23 brahms kernel: RIP: 0033:0x7f2b8e97fe6b
Jun 08 18:37:23 brahms kernel: Code: ff ff ff 85 c0 79 8b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f >
Jun 08 18:37:23 brahms kernel: RSP: 002b:00007fff8262ed98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 08 18:37:23 brahms kernel: RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f2b8e97fe6b
Jun 08 18:37:23 brahms kernel: RDX: 00007fff8262f1f0 RSI: 0000000000000021 RDI: 0000000000000005
Jun 08 18:37:23 brahms kernel: RBP: 00007f294dae57c0 R08: 00007f294dae5850 R09: 0000000000000000
Jun 08 18:37:23 brahms kernel: R10: 00007f2863400000 R11: 0000000000000246 R12: 00007fff8262edb0
Jun 08 18:37:23 brahms kernel: R13: 00007fff8262f1f0 R14: 00007fff8262edc8 R15: 00007f294dae57c0
Jun 08 18:37:23 brahms kernel: Mem-Info:
Jun 08 18:37:23 brahms kernel: active_anon:164186 inactive_anon:2558446 isolated_anon:32
                                active_file:19260 inactive_file:15666 isolated_file:0
                                unevictable:25 dirty:27 writeback:145
                                slab_reclaimable:58981 slab_unreclaimable:138040
                                mapped:185648 shmem:2390787 pagetables:46378 bounce:0
                                free:71958 free_pcp:5 free_cma:0
Jun 08 18:37:23 brahms kernel: Node 0 active_anon:656744kB inactive_anon:10233784kB active_file:77040kB inactive_file:62664kB unevictable:100kB isolated(anon):1>
Jun 08 18:37:23 brahms kernel: Node 0 DMA free:13844kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB ina>
Jun 08 18:37:23 brahms kernel: lowmem_reserve[]: 0 2648 15880 15880 15880
Jun 08 18:37:23 brahms kernel: Node 0 DMA32 free:89496kB min:36576kB low:39388kB high:42200kB reserved_highatomic:0KB active_anon:152968kB inactive_anon:2313356>
Jun 08 18:37:23 brahms kernel: lowmem_reserve[]: 0 0 13232 13232 13232
Jun 08 18:37:23 brahms kernel: Node 0 Normal free:184492kB min:182832kB low:196896kB high:210960kB reserved_highatomic:2048KB active_anon:503536kB inactive_anon>
Jun 08 18:37:23 brahms kernel: lowmem_reserve[]: 0 0 0 0 0
Jun 08 18:37:23 brahms kernel: Node 0 DMA: 5*4kB (U) 2*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 2*2048kB (UM) 2*409>
Jun 08 18:37:23 brahms kernel: Node 0 DMA32: 1241*4kB (ME) 3047*8kB (UME) 1088*16kB (UME) 549*32kB (UME) 394*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB>
Jun 08 18:37:23 brahms kernel: Node 0 Normal: 5756*4kB (UMEH) 13991*8kB (UMEH) 1170*16kB (UMEH) 964*32kB (UMH) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*20>
Jun 08 18:37:23 brahms kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 08 18:37:23 brahms kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 08 18:37:23 brahms kernel: 2473574 total pagecache pages
Jun 08 18:37:23 brahms kernel: 47760 pages in swap cache
Jun 08 18:37:23 brahms kernel: Swap cache stats: add 7034904, delete 6987914, find 590427/985303
Jun 08 18:37:23 brahms kernel: Free swap  = 49790460kB
Jun 08 18:37:23 brahms kernel: Total swap = 65535996kB
Jun 08 18:37:23 brahms kernel: 4171203 pages RAM
Jun 08 18:37:23 brahms kernel: 0 pages HighMem/MovableOnly
Jun 08 18:37:23 brahms kernel: 81710 pages reserved
Jun 08 18:37:23 brahms kernel: 0 pages cma reserved
Jun 08 18:37:23 brahms kernel: 0 pages hwpoisoned
Jun 08 18:37:23 brahms kernel: SLUB: Unable to allocate memory on node -1, gfp=0x10cc0(GFP_KERNEL|__GFP_NORETRY)
Jun 08 18:37:23 brahms kernel:   cache: kmalloc-2k, object size: 2048, buffer size: 2048, default order: 3, min order: 0
Jun 08 18:37:23 brahms kernel:   node 0: slabs: 1151, objs: 18318, free: 0

@RezaYazdaniAminabadi

stas00 commented 3 years ago

So that last machine was upgraded to 32GB cpu RAM and all is good. The crash was over system not handling well 0% free RAM (while having swap available).

I highly recommend to configure cgroups on machines with little RAM to protect the main system from memory-hungry processes and of course adding a serious chunk of swap memory goes a long way.

RezaYazdaniAminabadi commented 3 years ago

Thanks @stas00

I will work on adding some configurable parameters for that. Hopefully, we can fix this for those little-RAM systems

Reza

microsoft / DeepSpeed

serious AMD problems on some specific hardware #925