pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
81.98k stars 21.99k forks source link

Python sometimes crashes inexplicably, with "/var/log/kernel. log" displaying index errors #127275

Open danyow-cheung opened 3 months ago

danyow-cheung commented 3 months ago

🐛 Describe the bug

I opened two python programs on a linux machine . using the same gpu . However sometimes python programes crash , even when i just open a program , the /var/log/kern.log immediately display an error message . Here are some information below

2024-05-15T10:09:25.118507+08:00 trana6b-Default-string kernel: [  826.919949] UBSAN: array-index-out-of-bounds in /tmp/selfgz2789/NVIDIA-Linux-x86_64-535.154.05/kernel/nvidia-uvm/uvm_pmm_gpu.c:857:39
2024-05-15T10:09:25.118508+08:00 trana6b-Default-string kernel: [  826.919950] index 0 is out of range for type 'uvm_gpu_chunk_t *[*]'
2024-05-15T10:09:25.118508+08:00 trana6b-Default-string kernel: [  826.919951] CPU: 11 PID: 5286 Comm: python Tainted: P           OE      6.5.0-9-generic #9-Ubuntu
2024-05-15T10:09:25.118508+08:00 trana6b-Default-string kernel: [  826.919952] Hardware name: Gigabyte Technology Co., Ltd. Z790 AORUS ELITE AX/Z790 AORUS ELITE AX, BIOS FHe 12/08/2023
2024-05-15T10:09:25.118508+08:00 trana6b-Default-string kernel: [  826.919953] Call Trace:
2024-05-15T10:09:25.118508+08:00 trana6b-Default-string kernel: [  826.919953]  <TASK>
2024-05-15T10:09:25.118509+08:00 trana6b-Default-string kernel: [  826.919954]  dump_stack_lvl+0x48/0x70
2024-05-15T10:09:25.118509+08:00 trana6b-Default-string kernel: [  826.919956]  dump_stack+0x10/0x20
2024-05-15T10:09:25.118509+08:00 trana6b-Default-string kernel: [  826.919957]  __ubsan_handle_out_of_bounds+0xc6/0x110
2024-05-15T10:09:25.118509+08:00 trana6b-Default-string kernel: [  826.919959]  merge_gpu_chunk+0xc6/0x1d0 [nvidia_uvm]
2024-05-15T10:09:25.118510+08:00 trana6b-Default-string kernel: [  826.919987]  free_chunk_with_merges+0x13d/0x180 [nvidia_uvm]
2024-05-15T10:09:25.118510+08:00 trana6b-Default-string kernel: [  826.920013]  free_chunk+0xa4/0xd0 [nvidia_uvm]
2024-05-15T10:09:25.118510+08:00 trana6b-Default-string kernel: [  826.920039]  uvm_pmm_gpu_free+0xbf/0xf0 [nvidia_uvm]
2024-05-15T10:09:25.118510+08:00 trana6b-Default-string kernel: [  826.920064]  phys_mem_deallocate+0x33/0xd0 [nvidia_uvm]
2024-05-15T10:09:25.118511+08:00 trana6b-Default-string kernel: [  826.920093]  uvm_page_tree_put_ptes_async+0x4d5/0x580 [nvidia_uvm]
2024-05-15T10:09:25.118511+08:00 trana6b-Default-string kernel: [  826.920123]  uvm_page_table_range_vec_deinit+0x3e/0xd0 [nvidia_uvm]
2024-05-15T10:09:25.118511+08:00 trana6b-Default-string kernel: [  826.920151]  uvm_ext_gpu_map_destroy+0xd7/0x1f0 [nvidia_uvm]
2024-05-15T10:09:25.118511+08:00 trana6b-Default-string kernel: [  826.920176]  uvm_va_range_destroy+0x324/0x590 [nvidia_uvm]
2024-05-15T10:09:25.118511+08:00 trana6b-Default-string kernel: [  826.920203]  ? _nv025923rm+0x2b/0xf0 [nvidia]
2024-05-15T10:09:25.118512+08:00 trana6b-Default-string kernel: [  826.920401]  ? _nv043203rm+0xe9/0x1c0 [nvidia]
2024-05-15T10:09:25.118512+08:00 trana6b-Default-string kernel: [  826.920648]  uvm_api_free+0x188/0x320 [nvidia_uvm]
2024-05-15T10:09:25.118512+08:00 trana6b-Default-string kernel: [  826.920667]  uvm_ioctl+0xf6e/0x1cd0 [nvidia_uvm]
2024-05-15T10:09:25.118512+08:00 trana6b-Default-string kernel: [  826.920683]  ? _raw_spin_lock_irqsave+0xe/0x20
2024-05-15T10:09:25.118513+08:00 trana6b-Default-string kernel: [  826.920684]  ? os_acquire_spinlock+0x12/0x30 [nvidia]
2024-05-15T10:09:25.118513+08:00 trana6b-Default-string kernel: [  826.920828]  ? os_release_spinlock+0x1a/0x30 [nvidia]
2024-05-15T10:09:25.118513+08:00 trana6b-Default-string kernel: [  826.920970]  ? _nv047682rm+0xed/0x1d0 [nvidia]
2024-05-15T10:09:25.118513+08:00 trana6b-Default-string kernel: [  826.921113]  ? _nv043407rm+0x77/0xd0 [nvidia]
2024-05-15T10:09:25.118513+08:00 trana6b-Default-string kernel: [  826.921263]  ? _nv011756rm+0x86/0xa0 [nvidia]
2024-05-15T10:09:25.118514+08:00 trana6b-Default-string kernel: [  826.921413]  ? _raw_spin_lock_irqsave+0xe/0x20
2024-05-15T10:09:25.118514+08:00 trana6b-Default-string kernel: [  826.921414]  ? _raw_spin_lock_irqsave+0xe/0x20
2024-05-15T10:09:25.118514+08:00 trana6b-Default-string kernel: [  826.921415]  ? thread_context_non_interrupt_add+0x13a/0x2c0 [nvidia_uvm]
2024-05-15T10:09:25.118514+08:00 trana6b-Default-string kernel: [  826.921439]  uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm]
2024-05-15T10:09:25.118515+08:00 trana6b-Default-string kernel: [  826.921455]  ? nvidia_ioctl+0x369/0x8a0 [nvidia]
2024-05-15T10:09:25.118515+08:00 trana6b-Default-string kernel: [  826.921595]  ? kfree+0x78/0x120
2024-05-15T10:09:25.118515+08:00 trana6b-Default-string kernel: [  826.921596]  ? nvidia_ioctl+0x369/0x8a0 [nvidia]
2024-05-15T10:09:25.118515+08:00 trana6b-Default-string kernel: [  826.921736]  uvm_unlocked_ioctl_entry+0x6b/0x90 [nvidia_uvm]
2024-05-15T10:09:25.118516+08:00 trana6b-Default-string kernel: [  826.921752]  __x64_sys_ioctl+0xa0/0xf0
2024-05-15T10:09:25.118516+08:00 trana6b-Default-string kernel: [  826.921753]  do_syscall_64+0x59/0x90
2024-05-15T10:09:25.118516+08:00 trana6b-Default-string kernel: [  826.921754]  ? syscall_exit_to_user_mode+0x37/0x60
2024-05-15T10:09:25.118516+08:00 trana6b-Default-string kernel: [  826.921756]  ? do_syscall_64+0x68/0x90
2024-05-15T10:09:25.118516+08:00 trana6b-Default-string kernel: [  826.921757]  ? rcu_core_si+0xe/0x20
2024-05-15T10:09:25.118517+08:00 trana6b-Default-string kernel: [  826.921757]  ? __do_softirq+0xd6/0x346
2024-05-15T10:09:25.118517+08:00 trana6b-Default-string kernel: [  826.921759]  ? hrtimer_interrupt+0x11f/0x250
2024-05-15T10:09:25.118517+08:00 trana6b-Default-string kernel: [  826.921759]  ? exit_to_user_mode_prepare+0x30/0xb0
2024-05-15T10:09:25.118517+08:00 trana6b-Default-string kernel: [  826.921761]  ? irqentry_exit_to_user_mode+0x17/0x20
2024-05-15T10:09:25.118519+08:00 trana6b-Default-string kernel: [  826.921762]  ? irqentry_exit+0x43/0x50
2024-05-15T10:09:25.118520+08:00 trana6b-Default-string kernel: [  826.921763]  ? sysvec_apic_timer_interrupt+0x4b/0xd0
2024-05-15T10:09:25.118520+08:00 trana6b-Default-string kernel: [  826.921764]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
2024-05-15T10:09:25.118520+08:00 trana6b-Default-string kernel: [  826.921765] RIP: 0033:0x7fd25c72396f
2024-05-15T10:09:25.118520+08:00 trana6b-Default-string kernel: [  826.921769] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
2024-05-15T10:09:25.118521+08:00 trana6b-Default-string kernel: [  826.921770] RSP: 002b:00007fd2111db6d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
2024-05-15T10:09:25.118521+08:00 trana6b-Default-string kernel: [  826.921770] RAX: ffffffffffffffda RBX: 00007fd048ef01c0 RCX: 00007fd25c72396f
2024-05-15T10:09:25.118521+08:00 trana6b-Default-string kernel: [  826.921771] RDX: 00007fd2111db730 RSI: 0000000000000022 RDI: 0000000000000005
2024-05-15T10:09:25.118530+08:00 trana6b-Default-string kernel: [  826.921771] RBP: 00007fd2111db780 R08: 0000000000000000 R09: 0000000000000000
2024-05-15T10:09:25.118530+08:00 trana6b-Default-string kernel: [  826.921772] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
2024-05-15T10:09:25.118530+08:00 trana6b-Default-string kernel: [  826.921772] R13: 00007fd048ef01c0 R14: 00007fd2111db730 R15: 0000000000000005
2024-05-15T10:09:25.118530+08:00 trana6b-Default-string kernel: [  826.921773]  </TASK>
2024-05-15T10:09:25.118530+08:00 trana6b-Default-string kernel: [  826.921775] ================================================================================

I have sought help from the NVIDIA forum, but they have asked me to seek Torch's help

here's the link https://forums.developer.nvidia.com/t/linux-always-crash-when-run-the-python-program-using-gpu/292905

Looking forward to a reply

Versions

torch version is different

nvidia-smi driver version :

Driver Version: 535.154.05   CUDA Version: 12.2 

nvcc --version

Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

cc @ptrblck @msaroufim

soulitzer commented 3 months ago

Thanks for the report. Do you have a short self-contained script that can be used to reproduce the issue?

malfet commented 3 months ago

Crash in UVM driver strongly suggests that at least part of the problem is there, as kernel driver should never crash, even if user input is completely wrong

danyow-cheung commented 3 months ago

Thanks for the report. Do you have a short self-contained script that can be used to reproduce the issue?

Thank you for replying .I used the bert-vits2 for inferencing

The main problem comes from not knowing the location and time of the program crash. In addition, when I use BERT VITS2 to start the program, an error immediately appears in the log. But the program can work

Also I would like to provide more informations about this error . If needed , I can share my inference info about bert-vits2

2024-05-27T09:52:30.369058+08:00 trana6a-Default-string kernel: [  238.766333] Modules linked in: nvidia_uvm(POE) veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables libcrc32c nfnetlink br_netfilter bridge stp llc rfcomm snd_seq_dummy snd_hrtimer nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs ccm overlay cmac algif_hash algif_skcipher af_alg bnep binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_hda_codec_realtek snd_soc_acpi soundwire_generic_allocation snd_hda_codec_generic soundwire_bus ledtrig_audio snd_hda_codec_hdmi snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine x86_pkg_temp_thermal intel_powerclamp snd_hda_intel coretemp snd_intel_dspcfg
2024-05-27T09:52:30.369060+08:00 trana6a-Default-string kernel: [  238.766369]  snd_intel_sdw_acpi iwlmvm nvidia_drm(POE) snd_hda_codec kvm_intel nvidia_modeset(POE) snd_hda_core snd_hwdep mac80211 kvm snd_pcm btusb btrtl btbcm snd_seq_midi btintel snd_seq_midi_event btmtk libarc4 snd_rawmidi nvidia(POE) irqbypass crct10dif_pclmul bluetooth cmdlinepart polyval_clmulni snd_seq polyval_generic iwlwifi ghash_clmulni_intel spi_nor ecdh_generic aesni_intel ecc snd_seq_device mtd snd_timer crypto_simd cryptd rapl snd cfg80211 mei_me spi_intel_pci intel_cstate i2c_i801 gigabyte_wmi wmi_bmof intel_lpss_pci drm_kms_helper intel_lpss spi_intel soundcore i2c_smbus mei idma64 intel_hid acpi_tad acpi_pad sparse_keymap mac_hid msr parport_pc ppdev lp parport drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 hid_generic usbhid hid nvme crc32_pclmul r8169 nvme_core ahci video xhci_pci realtek libahci xhci_pci_renesas nvme_common wmi pinctrl_alderlake
2024-05-27T09:52:30.369060+08:00 trana6a-Default-string kernel: [  238.766408] CR2: ffffb699c264ff58
2024-05-27T09:52:30.369061+08:00 trana6a-Default-string kernel: [  238.766409] ---[ end trace 0000000000000000 ]---
2024-05-27T09:52:30.369061+08:00 trana6a-Default-string kernel: [  238.833290] RIP: 0010:0xffffb699c264ff58
2024-05-27T09:52:30.369062+08:00 trana6a-Default-string kernel: [  238.833303] Code: ff ff b8 99 33 92 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e6 00 40 92 ff ff ff ff <38> 9d 3d 9d 0b 7f 00 00 80 99 3d 9d 0b 7f 00 00 80 03 00 00 00 00
2024-05-27T09:52:30.369062+08:00 trana6a-Default-string kernel: [  238.833304] RSP: 0018:ffffb699c264fed8 EFLAGS: 00010046
2024-05-27T09:52:30.369063+08:00 trana6a-Default-string kernel: [  238.833306] RAX: 0000000000000000 RBX: ffffb699c264ff58 RCX: 0000000000000000
2024-05-27T09:52:30.369063+08:00 trana6a-Default-string kernel: [  238.833307] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
2024-05-27T09:52:30.369064+08:00 trana6a-Default-string kernel: [  238.833307] RBP: ffffb699c264fed8 R08: 0000000000000000 R09: 0000000000000000
2024-05-27T09:52:30.369064+08:00 trana6a-Default-string kernel: [  238.833308] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2024-05-27T09:52:30.369064+08:00 trana6a-Default-string kernel: [  238.833309] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
2024-05-27T09:52:30.369065+08:00 trana6a-Default-string kernel: [  238.833310] FS:  00007f0b973f86c0(0000) GS:ffff8e96ff840000(0000) knlGS:0000000000000000
2024-05-27T09:52:30.369065+08:00 trana6a-Default-string kernel: [  238.833311] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-05-27T09:52:30.369066+08:00 trana6a-Default-string kernel: [  238.833312] CR2: ffffb699c264ff58 CR3: 000000010ac06000 CR4: 0000000000750ee0
2024-05-27T09:52:30.369066+08:00 trana6a-Default-string kernel: [  238.833313] PKRU: 55555554
2024-05-27T09:52:30.369067+08:00 trana6a-Default-string kernel: [  238.833314] note: python[2945] exited with irqs disabled

my env info when using bert-vits2

torch                     2.0.0
torchaudio                2.0.0
torchvision               0.15.0
vector_quantize_pytorch   1.12.17

when i tap

import torch;print(torch.version.cuda)

the log shows

2024-05-29T17:01:41.081082+08:00 trana6a-Default-string kernel: [198781.508082] pip[1071441]: segfault at 0 ip 0000000000000000 sp 00007ffc96360338 error 14 in python3.10[400000+1f000] likely on CPU 8 (core 16, socket 0)
2024-05-29T17:01:41.081093+08:00 trana6a-Default-string kernel: [198781.508089] Code: Unable to access opcode bytes at 0xffffffffffffffd6.