Closed stas00 closed 3 years ago
I tried to look for a similar traceback and found this:
https://forums.developer.nvidia.com/t/kernel-call-trace-observed-when-calling-cudafreehost-cudahostalloc-for-buffers-on-amd-cpu-with-nvi/72930
which suggests a potential issue with the iommu
controller.
But since the problem happens on 2 identical machines, this is odd. Could be a problem in both.
Meanwhile I also tried other pytorch versions and cuda versions (within 10.x) and saw no change (but don't have 11.x at the moment).
Google gets only a small handful of hits for: https://www.google.com/search?q=Call+Trace+%22iommu_unmap_page%22+%22__unmap_single.isra%22 So perhaps this is very specific to that particular machine setup.
The problem is still there on yet another box - running transformers/deepspeed tests crashes the box:
AMD Ryzen 9 again.
ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [YES] ...... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sgugger/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/torch']
torch version .................... 1.8.1+cu111
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/sgugger/git/DeepSpeed/deepspeed']
deepspeed info ................... 0.4.0+ccc522c, ccc522c, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 113
Model name: AMD Ryzen 9 3900X 12-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 3056.768
CPU max MHz: 6078.5151
CPU min MHz: 2200.0000
BogoMIPS: 7588.61
Virtualization: AMD-V
L1d cache: 384 KiB
L1i cache: 384 KiB
L2 cache: 6 MiB
L3 cache: 64 MiB
NUMA node0 CPU(s): 0-23
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, STIBP conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxe
xt fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq moni
tor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy ab
m sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx c
pb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed
adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clz
ero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists
pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
Update, 2 months later using now pytorch-1.8.1 (Still cuda-10.2) the problem reported in OP on 2 identical machines is no more. All tests pass w/o crashing the machine. It's hard to tell whether something changed in deepspeed or pytorch, since the rest of the environment hasn't changed. But that's not true - I'm pretty sure the RAM was upped from 16 to 64GB. So this might be it.
The case with the most recent report https://github.com/microsoft/DeepSpeed/issues/925#issuecomment-853984722 is still there. This too is PyTorch version: 1.8.1 but cuda-11.1! And I managed to get the kernel trace (This is archlinux manjara whereas the previous ones were Ubuntu).:
Also this one has only 16GB RAM and it's already used quite a lot by the desktop apps. And the test suite runs around 13GB resident peak. So it's probably the lack of cgroups setup that crashes the system.
$ journalctl --since=today
Jun 08 18:37:23 brahms kernel: python3.7: page allocation failure: order:0, mode:0x50cc0(GFP_KERNEL|__GFP_NORETRY|__GFP_COMP), nodemask=(null),cpuset=/,mems_all>
Jun 08 18:37:23 brahms kernel: CPU: 7 PID: 582916 Comm: python3.7 Tainted: P OE 5.10.36-2-MANJARO #1
Jun 08 18:37:23 brahms kernel: Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 2606 08/13/2020
Jun 08 18:37:23 brahms kernel: Call Trace:
Jun 08 18:37:23 brahms kernel: dump_stack+0x6b/0x83
Jun 08 18:37:23 brahms kernel: warn_alloc.cold+0x78/0xdc
Jun 08 18:37:23 brahms kernel: __alloc_pages_slowpath.constprop.0+0xce8/0xd20
Jun 08 18:37:23 brahms kernel: ? blk_attempt_plug_merge+0xb0/0xe0
Jun 08 18:37:23 brahms kernel: ? __sbitmap_get_word+0x2a/0x80
Jun 08 18:37:23 brahms kernel: ? sched_clock+0x5/0x10
Jun 08 18:37:23 brahms kernel: ? sched_clock_cpu+0xc/0xb0
Jun 08 18:37:23 brahms kernel: ? sched_clock+0x5/0x10
Jun 08 18:37:23 brahms kernel: ? sched_clock_cpu+0xc/0xb0
Jun 08 18:37:23 brahms kernel: __alloc_pages_nodemask+0x338/0x370
Jun 08 18:37:23 brahms kernel: allocate_slab+0x341/0x4c0
Jun 08 18:37:23 brahms kernel: ___slab_alloc+0x3cc/0x5a0
Jun 08 18:37:23 brahms kernel: ? __uvm_kvmalloc+0x2c/0x80 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel: ? uvm_va_range_create_external+0x2c/0xd0 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel: ? kmem_cache_alloc+0x15c/0x2a0
Jun 08 18:37:23 brahms kernel: ? __uvm_kvmalloc+0x2c/0x80 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel: __slab_alloc.constprop.0+0x27/0x40
Jun 08 18:37:23 brahms kernel: __kmalloc+0x2a5/0x2e0
Jun 08 18:37:23 brahms kernel: __uvm_kvmalloc+0x2c/0x80 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel: uvm_ioctl+0x825/0x1f30 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel: ? update_load_avg+0x7e/0x630
Jun 08 18:37:23 brahms kernel: ? sched_clock+0x5/0x10
Jun 08 18:37:23 brahms kernel: uvm_unlocked_ioctl_entry+0x143/0x190 [nvidia_uvm]
Jun 08 18:37:23 brahms kernel: __x64_sys_ioctl+0x83/0xb0
Jun 08 18:37:23 brahms kernel: do_syscall_64+0x33/0x40
Jun 08 18:37:23 brahms kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 08 18:37:23 brahms kernel: RIP: 0033:0x7f2b8e97fe6b
Jun 08 18:37:23 brahms kernel: Code: ff ff ff 85 c0 79 8b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f >
Jun 08 18:37:23 brahms kernel: RSP: 002b:00007fff8262ed98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 08 18:37:23 brahms kernel: RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f2b8e97fe6b
Jun 08 18:37:23 brahms kernel: RDX: 00007fff8262f1f0 RSI: 0000000000000021 RDI: 0000000000000005
Jun 08 18:37:23 brahms kernel: RBP: 00007f294dae57c0 R08: 00007f294dae5850 R09: 0000000000000000
Jun 08 18:37:23 brahms kernel: R10: 00007f2863400000 R11: 0000000000000246 R12: 00007fff8262edb0
Jun 08 18:37:23 brahms kernel: R13: 00007fff8262f1f0 R14: 00007fff8262edc8 R15: 00007f294dae57c0
Jun 08 18:37:23 brahms kernel: Mem-Info:
Jun 08 18:37:23 brahms kernel: active_anon:164186 inactive_anon:2558446 isolated_anon:32
active_file:19260 inactive_file:15666 isolated_file:0
unevictable:25 dirty:27 writeback:145
slab_reclaimable:58981 slab_unreclaimable:138040
mapped:185648 shmem:2390787 pagetables:46378 bounce:0
free:71958 free_pcp:5 free_cma:0
Jun 08 18:37:23 brahms kernel: Node 0 active_anon:656744kB inactive_anon:10233784kB active_file:77040kB inactive_file:62664kB unevictable:100kB isolated(anon):1>
Jun 08 18:37:23 brahms kernel: Node 0 DMA free:13844kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB ina>
Jun 08 18:37:23 brahms kernel: lowmem_reserve[]: 0 2648 15880 15880 15880
Jun 08 18:37:23 brahms kernel: Node 0 DMA32 free:89496kB min:36576kB low:39388kB high:42200kB reserved_highatomic:0KB active_anon:152968kB inactive_anon:2313356>
Jun 08 18:37:23 brahms kernel: lowmem_reserve[]: 0 0 13232 13232 13232
Jun 08 18:37:23 brahms kernel: Node 0 Normal free:184492kB min:182832kB low:196896kB high:210960kB reserved_highatomic:2048KB active_anon:503536kB inactive_anon>
Jun 08 18:37:23 brahms kernel: lowmem_reserve[]: 0 0 0 0 0
Jun 08 18:37:23 brahms kernel: Node 0 DMA: 5*4kB (U) 2*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 2*2048kB (UM) 2*409>
Jun 08 18:37:23 brahms kernel: Node 0 DMA32: 1241*4kB (ME) 3047*8kB (UME) 1088*16kB (UME) 549*32kB (UME) 394*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB>
Jun 08 18:37:23 brahms kernel: Node 0 Normal: 5756*4kB (UMEH) 13991*8kB (UMEH) 1170*16kB (UMEH) 964*32kB (UMH) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*20>
Jun 08 18:37:23 brahms kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 08 18:37:23 brahms kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 08 18:37:23 brahms kernel: 2473574 total pagecache pages
Jun 08 18:37:23 brahms kernel: 47760 pages in swap cache
Jun 08 18:37:23 brahms kernel: Swap cache stats: add 7034904, delete 6987914, find 590427/985303
Jun 08 18:37:23 brahms kernel: Free swap = 49790460kB
Jun 08 18:37:23 brahms kernel: Total swap = 65535996kB
Jun 08 18:37:23 brahms kernel: 4171203 pages RAM
Jun 08 18:37:23 brahms kernel: 0 pages HighMem/MovableOnly
Jun 08 18:37:23 brahms kernel: 81710 pages reserved
Jun 08 18:37:23 brahms kernel: 0 pages cma reserved
Jun 08 18:37:23 brahms kernel: 0 pages hwpoisoned
Jun 08 18:37:23 brahms kernel: SLUB: Unable to allocate memory on node -1, gfp=0x10cc0(GFP_KERNEL|__GFP_NORETRY)
Jun 08 18:37:23 brahms kernel: cache: kmalloc-2k, object size: 2048, buffer size: 2048, default order: 3, min order: 0
Jun 08 18:37:23 brahms kernel: node 0: slabs: 1151, objs: 18318, free: 0
@RezaYazdaniAminabadi
So that last machine was upgraded to 32GB cpu RAM and all is good. The crash was over system not handling well 0% free RAM (while having swap available).
I highly recommend to configure cgroups
on machines with little RAM to protect the main system from memory-hungry processes and of course adding a serious chunk of swap memory goes a long way.
Thanks @stas00
I will work on adding some configurable parameters for that. Hopefully, we can fix this for those little-RAM systems
Reza
I'm trying my latest HF/transformers deepspeed tests on 4 different machines:
AMD Ryzen 9 is either taking forever or it crashes.
The machine of the last entry spews a ton of these:
and then crashes.
OK the other difference is CUDA versions.
The tests are very light - doing very tiny batches for just a few iterations. So far from being stressed out - the gpus are mostly idle. I don't think the RAM difference is of any difference either.
I'm using the release candidate branch that @jeffra made: https://github.com/microsoft/DeepSpeed/tree/multi-z3-prs But it has been like this for a while now - I originally thought it was just some odd issue with this one machine, but now I'm seeing an identical problem with another identical machine.
OK,
dmesg
has a ton of these:Here is one of the problematic machines:
The problematic CPU on both machines:
Also I used to run deepspeed just fine on this machine a few months ago with the same cuda-10.2. Could it be related to changes introduced by https://github.com/microsoft/DeepSpeed/pull/735, since it was made for deepspeed segfaulting on AMD? edit: I reverted the changes from this PR, rebuilt and the problem is the same. So that PR didn't introduce this problem.
The problem happens both with master and also 0.3.13.
@jeffra, @RezaYazdaniAminabadi