madMAx43v3r / chia-gigahorse

224 stars 31 forks source link

Cuda V3 start to fail with latest Nvidia drivers #301

Closed cdgraff closed 7 months ago

cdgraff commented 7 months ago
Gigahorse 3.0 k32 CUDA plotter - 94d0ea0
Plot Format: mmx-v3.0
Network Port: 8444 [chia]
No. GPUs: 1
No. Streams: 4
Direct IO: No
Final Destination: /chia/farm4/
Final Destination: /chia/farm2/
Final Destination: /chia/farm3/
Final Destination: /chia/farm1/
Bucket Chunk Size: 8 MiB
Max Pinned Memory: 480 GiB
Number of Plots: infinite
Initialization took 0.15 sec
Crafting plot 1 out of -1 (2024/04/07 01:53:38)
Phase 1 took 143.429 sec, 4286910974 proofs (0.998124)
Phase 2 took 5.557 sec, 7.90309 GB/s up, 0.182787 GB/s down
[P3] Setup took 0.041 sec
P3 upload thread failed with: CUDA error 719: unspecified launch failure
P3 upload thread failed with: P3 download thread failed with: CUDA error 719: unspecified launch failureP3 download thread failed with: CUDA error 719: unspecified launch failureCUDA error 719: unspecified launch failure
P3 upload thread failed with: CUDA error 719: unspecified launch failure
P3 upload thread failed with: CUDA error 719: unspecified launch failure

terminate called after throwing an instance of 'std::runtime_error'
terminate called recursively
terminate called recursively
terminate called recursively
  what():

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
NVIDIA GeForce RTX 3090  

root 13332 24.2 0.0 0 0 pts/2 Zl+ 01:53 13:20 [cuda_plot_k32_v] <defunct>

cdgraff commented 7 months ago

Dmesg output:

[14328.198496] NVRM: GPU at PCI:0000:0a:00: GPU-760d0170-b7c5-6817-9414-6cc7112645de
[14328.198504] NVRM: Xid (PCI:0000:0a:00): 31, pid=13332, name=cuda_plot_k32_v, Ch 00000010, intr 00000000. MMU Fault: ENGINE CE2 HUBCLIENT_CE0 faulted @ 0x7f29_5b8c9000. Fault is of type FAULT_PRIV_VIOLATION ACCESS_TYPE_VIRT_READ
[14370.687545] NVRM: Xid (PCI:0000:0a:00): 62, pid='<unknown>', name=<unknown>, 2023b230 20238a3a 2023e8b6 2023eac4 2023ec3c 2023ea86 00000000 00000000
[14370.688139] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000008
[14372.202098] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000009
[14372.273528] sched: RT throttling activated
[14376.322743] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000a
[14380.322701] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000b
[14384.322652] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000c
[14388.322709] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000d
[14392.322666] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000e
[14397.322674] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000f
[14400.337883] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000000
[14400.338789] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000001
[14400.339670] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000002
[14400.340552] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000003
[14400.341457] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000004
[14400.342350] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000005
[14400.343255] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000006
[14400.344143] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000008
[14400.345119] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000009
[14400.346101] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000a
[14400.347073] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000b
[14400.348043] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000c
[14400.349013] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000d
[14400.349995] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000e
[14400.350969] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 0000000f
[14400.351941] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000010
[14400.352859] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000011
[14400.353779] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000012
[14400.354696] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000013
[14400.355609] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000014
[14400.356533] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000015
[14400.357455] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000016
[14400.358374] NVRM: Xid (PCI:0000:0a:00): 45, pid='<unknown>', name=<unknown>, Ch 00000017
[14400.359865] NVRM: Xid (PCI:0000:0a:00): 31, pid=13332, name=cuda_plot_k32_v, Ch 00000014, intr 00000000. MMU Fault: ENGINE HOST3 HUBCLIENT_ESC faulted @ 0xa4_1bf48000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[16996.831966] watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [cuda_plot_k32_v:13491]
[16996.916195] Modules linked in: nvidia_uvm(POE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay cdc_ether usbnet r8152 mii nvidia_drm(POE) nvidia_modeset(POE) iwlmvm binfmt_misc mac80211 nvidia(POE) snd_hda_codec_realtek libarc4 snd_hda_codec_generic ledtrig_audio intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common snd_hda_intel nls_iso8859_1 iwlwifi edac_mce_amd snd_intel_dspcfg snd_intel_sdw_acpi btusb snd_hda_codec btrtl btbcm drm_kms_helper snd_hda_core btintel kvm cec snd_hwdep snd_pcm rc_core bluetooth joydev fb_sys_fops snd_timer syscopyarea cfg80211 input_leds rapl ecdh_generic sysfillrect snd ecc eeepc_wmi sysimgblt wmi_bmof soundcore ccp k10temp mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr drm efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10
[16996.916230]  raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_logitech_hidpp hid_logitech_dj hid_generic usbhid mfd_aaeon hid crct10dif_pclmul asus_wmi crc32_pclmul sparse_keymap ghash_clmulni_intel sha256_ssse3 video sha1_ssse3 aesni_intel platform_profile crypto_simd cryptd r8169 nvme ahci xhci_pci i2c_piix4 libahci nvme_core realtek xhci_pci_renesas wmi
[16996.916247] CPU: 8 PID: 13491 Comm: cuda_plot_k32_v Tainted: P           OE     5.15.0-101-generic #111-Ubuntu
[16996.916249] Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS (WI-FI), BIOS 4805 08/14/2023
[16996.916250] RIP: 0010:_nv041366rm+0x3b/0x80 [nvidia]
[16996.916435] Code: d3 89 de 48 8d 55 0f c6 45 0f 00 e8 9f ae 59 ff 80 7d 0f 00 41 89 c4 75 11 41 39 5d 10 76 20 49 8b 45 00 c1 eb 02 44 8b 24 98 <5b> 44 89 e0 41 5c 41 5d 48 83 c5 10 c3 0f 1f 84 00 00 00 00 00 be
[16996.916436] RSP: 0018:ffffb58c03adb740 EFLAGS: 00000212
[16996.916437] RAX: ffffb58c05000000 RBX: 00000000002e0c2c RCX: 0000000000b830b0
[16996.916438] RDX: ffff9515f7885c1f RSI: 0000000000b830b0 RDI: ffff951630af8008
[16996.916438] RBP: ffff9515f7885c10 R08: 0000000000000020 R09: 0000000000000000
[16996.916439] R10: 0000000000b830b0 R11: 0000000000000000 R12: 0000000080010005
[16996.916439] R13: ffff951630af8be8 R14: 0000000000000000 R15: 0000000000000000
[16996.916440] FS:  0000000000000000(0000) GS:ffff9533aec00000(0000) knlGS:0000000000000000
[16996.916441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16996.916441] CR2: 000055d4d8339080 CR3: 0000001157010000 CR4: 0000000000750ee0
[16996.916442] PKRU: 55555554
[16996.916443] Call Trace:
[16996.916444]  <IRQ>
[16996.916446]  ? show_trace_log_lvl+0x1d6/0x2ea
[16996.916450]  ? show_trace_log_lvl+0x1d6/0x2ea
[16996.916451]  ? show_regs.part.0+0x23/0x29
[16996.916452]  ? show_regs.cold+0x8/0xd
[16996.916454]  ? watchdog_timer_fn+0x1be/0x220
[16996.916456]  ? lockup_detector_update_enable+0x60/0x60
[16996.916456]  ? __hrtimer_run_queues+0x107/0x230
[16996.916459]  ? clockevents_program_event+0xad/0x130
[16996.916461]  ? hrtimer_interrupt+0x101/0x220
[16996.916462]  ? __sysvec_apic_timer_interrupt+0x61/0xe0
[16996.916464]  ? sysvec_apic_timer_interrupt+0x7b/0x90
[16996.916467]  </IRQ>
[16996.916467]  <TASK>
[16996.916467]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[16996.916470]  ? _nv041366rm+0x3b/0x80 [nvidia]
[16996.916626]  ? _nv014101rm+0x10f/0x170 [nvidia]
[16996.916882]  ? _nv034250rm+0xd7/0x120 [nvidia]
[16996.917130]  ? _nv034442rm+0x1df/0x360 [nvidia]
[16996.917377]  ? _nv029355rm+0xf0/0x1f0 [nvidia]
[16996.917516]  ? _nv029355rm+0xc0/0x1f0 [nvidia]
[16996.917652]  ? _nv012160rm+0x4f8/0x5c0 [nvidia]
[16996.917787]  ? _nv022499rm+0x3b3/0x750 [nvidia]
[16996.918003]  ? _nv033439rm+0x14f/0x340 [nvidia]
[16996.918162]  ? _nv033439rm+0x11f/0x340 [nvidia]
[16996.918320]  ? _nv036579rm+0x393/0x460 [nvidia]
[16996.918464]  ? _nv017155rm+0xa7/0x1a0 [nvidia]
[16996.918656]  ? _nv047093rm+0x181/0x1d0 [nvidia]
[16996.918838]  ? _nv045287rm+0xdd/0x130 [nvidia]
[16996.918981]  ? _nv045288rm+0x53/0x80 [nvidia]
[16996.919122]  ? _nv045286rm+0x2f/0x40 [nvidia]
[16996.919261]  ? _nv045267rm+0x80/0x80 [nvidia]
[16996.919400]  ? _nv039055rm+0x9f/0x100 [nvidia]
[16996.919540]  ? _nv039055rm+0x6d/0x100 [nvidia]
[16996.919679]  ? _nv039010rm+0x25b/0x4f0 [nvidia]
[16996.919820]  ? rm_gpu_ops_channel_destroy+0x20/0x60 [nvidia]
[16996.919972]  ? nvUvmGetSafeStack+0x93/0xc0 [nvidia]
[16996.920091]  ? nvUvmInterfaceChannelDestroy+0x23/0x80 [nvidia]
[16996.920209]  ? channel_destroy+0xaf/0x220 [nvidia_uvm]
[16996.920223]  ? channel_pool_destroy+0x2f/0x90 [nvidia_uvm]
[16996.920233]  ? uvm_channel_manager_destroy.part.0+0x7b/0xc0 [nvidia_uvm]
[16996.920242]  ? uvm_channel_manager_destroy+0x13/0x20 [nvidia_uvm]
[16996.920251]  ? remove_gpu+0x1e5/0x440 [nvidia_uvm]
[16996.920261]  ? uvm_gpu_release_locked+0x2c/0x70 [nvidia_uvm]
[16996.920271]  ? uvm_va_space_destroy+0x57d/0x6d0 [nvidia_uvm]
[16996.920281]  ? uvm_release.constprop.0+0xa3/0x130 [nvidia_uvm]
[16996.920290]  ? uvm_release_entry.part.0.isra.0+0x80/0xb0 [nvidia_uvm]
[16996.920299]  ? security_file_free+0x54/0x60
[16996.920301]  ? kmem_cache_free+0x272/0x290
[16996.920303]  ? __call_rcu+0xa8/0x270
[16996.920305]  ? uvm_release_entry+0x2a/0x30 [nvidia_uvm]
[16996.920314]  ? __fput+0x9f/0x280
[16996.920316]  ? ____fput+0xe/0x20
[16996.920317]  ? task_work_run+0x6d/0xb0
[16996.920319]  ? do_exit+0x217/0x3c0
[16996.920321]  ? do_group_exit+0x3b/0xb0
[16996.920322]  ? get_signal+0x150/0x900
[16996.920323]  ? send_signal+0xe9/0x130
[16996.920325]  ? arch_do_signal_or_restart+0xde/0x100
[16996.920327]  ? do_send_specific+0x61/0xa0
[16996.920329]  ? exit_to_user_mode_loop+0xc4/0x160
[16996.920331]  ? exit_to_user_mode_prepare+0xa0/0xb0
[16996.920332]  ? syscall_exit_to_user_mode+0x27/0x50
[16996.920333]  ? __x64_sys_tgkill+0x29/0x40
[16996.920334]  ? do_syscall_64+0x69/0xc0
[16996.920335]  ? syscall_exit_to_user_mode+0x35/0x50
[16996.920336]  ? __do_sys_getpid+0x1e/0x30
[16996.920337]  ? do_syscall_64+0x69/0xc0
[16996.920338]  ? exit_to_user_mode_prepare+0x37/0xb0
[16996.920339]  ? irqentry_exit_to_user_mode+0x17/0x20
[16996.920340]  ? irqentry_exit+0x1d/0x30
[16996.920342]  ? exc_page_fault+0x89/0x170
[16996.920343]  ? entry_SYSCALL_64_after_hwframe+0x62/0xcc
[16996.920344]  </TASK>
madMAx43v3r commented 7 months ago

Did you try a reboot? unspecified launch failure is usually a driver issue.

cdgraff commented 7 months ago

Yes! I rebooted and change the disk and installed a new Ubuntu from clean, this started into the previos disk, and I try to changed Nvidia driver (to much installations of different drivers into the same OS), as nothing fix the issue, I removed that disk and installed a new one, clean, and just with that version.

Into the previous installation I was using 525.x and after do the upgrade to latest Nvidia driver suggested by Ubuntu then the issues, start.

I tested the GPU running "ProofOfSpace" tool

[chiapos] Using 16 / 24 CPU threads
[chiapos] Using 1 / 1 CUDA devices
Total success:  1020 / 1000, 102 %
Total failures: 0 / 1000, 0 %
Total filtered: 1020 / 1020, 100 %
Partial Difficulty: 5000 (0.011114 % chance)
Max Farm Size @512: 0.807275 PiB (physical)
Max Farm Size @256: 0.403638 PiB (physical)
Max Farm Size @128: 0.201819 PiB (physical)
Average time to compute quality: 0.198631 sec
Maximum time to compute full proof: 0 sec
cdgraff commented 7 months ago

After finish the ProofOfSpace get:


[41827.352988] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[41827.356851] nvidia-uvm: Loaded the UVM driver, major device number 506.
[42431.725053] process 'chiapos/linux/x86_64/ProofOfSpace' started with executable stack
[42786.945786] NVRM: GPU at PCI:0000:0a:00: GPU-760d0170-b7c5-6817-9414-6cc7112645de
[42786.945788] NVRM: Xid (PCI:0000:0a:00): 62, pid='<unknown>', name=<unknown>, 20247fda 2024118a 20255152 202553d2 2021a5be 00000000 00000000 00000000
[42846.417300] BUG: kernel NULL pointer dereference, address: 0000000000000000
[42846.417302] #PF: supervisor read access in kernel mode
[42846.417303] #PF: error_code(0x0000) - not-present page
[42846.417303] PGD 0 P4D 0
[42846.417305] Oops: 0000 [#1] SMP NOPTI
[42846.417307] CPU: 18 PID: 1040 Comm: nv_open_q Tainted: P           OE     5.15.0-101-generic #111-Ubuntu
[42846.417308] Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS (WI-FI), BIOS 4805 08/14/2023
[42846.417309] RIP: 0010:_nv043111rm+0x6c/0x5d0 [nvidia]
[42846.417554] Code: 83 22 07 00 00 3c ff 0f 84 e1 03 00 00 8d 50 01 88 93 22 07 00 00 84 c0 75 d3 80 bb 21 07 00 00 00 74 ca 48 8b 83 10 07 00 00 <80> 38 00 0f 84 fb 02 00 00 66 83 78 1a ff 0f 84 f0 02 00 00 31 f6
[42846.417556] RSP: 0018:ffffa32d42477a78 EFLAGS: 00010202
[42846.417557] RAX: 0000000000000000 RBX: ffff8b3159f20008 RCX: 0000000000000000
[42846.417558] RDX: 0000000000000001 RSI: ffff8b3159f20008 RDI: ffff8b3225060008
[42846.417558] RBP: ffff8b45d7d1ace0 R08: ffff8b45d7d1a8f0 R09: ffff8b45d7d1a8f0
[42846.417559] R10: 00000000282e5738 R11: 000000006612bb94 R12: 0000000000000000
[42846.417559] R13: ffff8b3225060008 R14: 0000000000000001 R15: ffff8b45da7b0010
[42846.417560] FS:  0000000000000000(0000) GS:ffff8b4feee80000(0000) knlGS:0000000000000000
[42846.417561] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42846.417561] CR2: 0000000000000000 CR3: 00000001071ac000 CR4: 0000000000750ee0
[42846.417562] PKRU: 55555554
[42846.417562] Call Trace:
[42846.417564]  <TASK>
[42846.417565]  ? show_trace_log_lvl+0x1d6/0x2ea
[42846.417569]  ? show_trace_log_lvl+0x1d6/0x2ea
[42846.417571]  ? show_regs.part.0+0x23/0x29
[42846.417572]  ? __die_body.cold+0x8/0xd
[42846.417573]  ? __die+0x2b/0x37
[42846.417574]  ? page_fault_oops+0x13b/0x170
[42846.417577]  ? os_acquire_spinlock+0x12/0x30 [nvidia]
[42846.417694]  ? do_user_addr_fault+0x321/0x670
[42846.417696]  ? os_acquire_spinlock+0x12/0x30 [nvidia]
[42846.417811]  ? exc_page_fault+0x77/0x170
[42846.417814]  ? asm_exc_page_fault+0x27/0x30
[42846.417817]  ? _nv043111rm+0x6c/0x5d0 [nvidia]
[42846.418035]  ? _nv043111rm+0x1f/0x5d0 [nvidia]
[42846.418249]  ? _nv013553rm+0xda/0x210 [nvidia]
[42846.418465]  ? _nv044241rm+0x1fd/0x260 [nvidia]
[42846.418713]  ? _nv042300rm+0xd1/0x1d0 [nvidia]
[42846.418951]  ? _nv013342rm+0x5a/0xd0 [nvidia]
[42846.419182]  ? _nv044241rm+0x1fd/0x260 [nvidia]
[42846.419435]  ? _nv011236rm+0xe1/0x160 [nvidia]
[42846.419601]  ? discard_slab+0x38/0x60
[42846.419603]  ? _nv044241rm+0x1fd/0x260 [nvidia]
[42846.419857]  ? _nv050797rm+0x20/0x2e0 [nvidia]
[42846.420071]  ? _nv014767rm+0x50/0x100 [nvidia]
[42846.420280]  ? _nv044241rm+0x1fd/0x260 [nvidia]
[42846.420536]  ? _nv014811rm+0xf1/0x2f0 [nvidia]
[42846.420753]  ? _nv044241rm+0x1fd/0x260 [nvidia]
[42846.421010]  ? _nv017517rm+0x35/0x110 [nvidia]
[42846.421177]  ? _nv018637rm+0x13b/0x3d0 [nvidia]
[42846.421337]  ? _nv026649rm+0x97/0x1a0 [nvidia]
[42846.421589]  ? _nv000773rm+0x1b3/0x313 [nvidia]
[42846.421741]  ? _nv000720rm+0x482/0x20e0 [nvidia]
[42846.421886]  ? rm_init_adapter+0xcd/0xf0 [nvidia]
[42846.422033]  ? ttwu_queue_wakelist+0x131/0x1c0
[42846.422035]  ? wake_up_process+0x15/0x20
[42846.422037]  ? nv_open_device+0x5a7/0xab0 [nvidia]
[42846.422155]  ? nvidia_open_deferred+0x39/0xa0 [nvidia]
[42846.422272]  ? _main_loop+0x8c/0x140 [nvidia]
[42846.422391]  ? nvidia_modeset_resume+0x30/0x30 [nvidia]
[42846.422509]  ? kthread+0x12a/0x150
[42846.422511]  ? set_kthread_struct+0x50/0x50
[42846.422512]  ? ret_from_fork+0x22/0x30
[42846.422515]  </TASK>
[42846.422515] Modules linked in: nvidia_uvm(POE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay nvidia_drm(POE) nvidia_modeset(POE) iwlmvm binfmt_misc mac80211 snd_hda_codec_realtek nvidia(POE) nls_iso8859_1 libarc4 snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common snd_hda_intel snd_intel_dspcfg edac_mce_amd snd_intel_sdw_acpi snd_hda_codec btusb drm_kms_helper btrtl btbcm snd_hda_core iwlwifi btintel cec snd_hwdep bluetooth snd_pcm rc_core kvm cfg80211 ecdh_generic fb_sys_fops syscopyarea joydev snd_timer ecc sysfillrect input_leds snd sysimgblt rapl soundcore ccp k10temp eeepc_wmi wmi_bmof mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr drm efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov
[42846.422544]  async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid mfd_aaeon asus_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sparse_keymap sha256_ssse3 video sha1_ssse3 aesni_intel platform_profile crypto_simd ahci r8169 cryptd nvme xhci_pci i2c_piix4 libahci xhci_pci_renesas nvme_core realtek wmi
[42846.422557] CR2: 0000000000000000
[42846.422558] ---[ end trace 441ac92c3b0c8fb1 ]---
[42846.450171] pstore: crypto_comp_compress failed, ret = -22!
[42846.562084] RIP: 0010:_nv043111rm+0x6c/0x5d0 [nvidia]
[42846.562317] Code: 83 22 07 00 00 3c ff 0f 84 e1 03 00 00 8d 50 01 88 93 22 07 00 00 84 c0 75 d3 80 bb 21 07 00 00 00 74 ca 48 8b 83 10 07 00 00 <80> 38 00 0f 84 fb 02 00 00 66 83 78 1a ff 0f 84 f0 02 00 00 31 f6
[42846.562318] RSP: 0018:ffffa32d42477a78 EFLAGS: 00010202
[42846.562319] RAX: 0000000000000000 RBX: ffff8b3159f20008 RCX: 0000000000000000
[42846.562319] RDX: 0000000000000001 RSI: ffff8b3159f20008 RDI: ffff8b3225060008
[42846.562320] RBP: ffff8b45d7d1ace0 R08: ffff8b45d7d1a8f0 R09: ffff8b45d7d1a8f0
[42846.562320] R10: 00000000282e5738 R11: 000000006612bb94 R12: 0000000000000000
[42846.562320] R13: ffff8b3225060008 R14: 0000000000000001 R15: ffff8b45da7b0010
[42846.562321] FS:  0000000000000000(0000) GS:ffff8b4feee80000(0000) knlGS:0000000000000000
[42846.562322] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42846.562322] CR2: 0000000000000000 CR3: 00000001071ac000 CR4: 0000000000750ee0
[42846.562323] PKRU: 55555554
cdgraff commented 7 months ago

I can confirm that GPU is working well, as I tested running IA models into the same hardware and all work as expected

madMAx43v3r commented 7 months ago

BUG: kernel NULL pointer dereference, address: 0000000000000000

Looks like a driver bug to me. Can you revert to an older driver like 535 ?

sudo apt remove nvidia-driver-*
sudo apt install nvidia-driver-535

And disable auto update: sudo apt remove unattended-upgrades

cdgraff commented 7 months ago

Look different but still with error:

Driver Version: 535.161.07 CUDA Version: 12.2

Created disk buffer /mnt/raid0/cuda_plot_tmp2_1712586287418419.tmp
[P1] Setup took 0.318 sec
[P1] Table 1 took 15.579 sec, 4294967296 entries, 0 GB/s up, 2.31087 GB/s down
[P1] Table 2 took 16.077 sec, 4294833302 entries, 1.99042 GB/s up, 4.19861 GB/s down
[P1] Table 3 took 21.048 sec, 4294617404 entries, 2.85054 GB/s up, 5.34497 GB/s down
[P1] Table 4 took 28.139 sec, 4294113857 entries, 3.5535 GB/s up, 5.27741 GB/s down
[P1] Table 5 took 35.229 sec, 4293218602 entries, 3.74617 GB/s up, 4.85398 GB/s down
[P1] Table 6 took 14.094 sec, 4291356755 entries, 5.39017 GB/s up, 7.56988 GB/s down
[P1] Table 7 took 9.449 sec, 4287590547 entries, 8.88236 GB/s up, 6.66748 GB/s down
Phase 1 took 140.132 sec, 4287590547 proofs (0.998282)
mark_used(): out of range: 5448287559
mark_used(): out of range: 5448287942
Phase 2 took 5.548 sec, 7.91716 GB/s up, 0.183083 GB/s down
[P3] Setup took 0.041 sec
P3 download thread failed with: CUDA error 719: unspecified launch failureP3 download thread failed with: CUDA error 719: unspecified launch failure
P3 upload thread failed with: CUDA error 719: unspecified launch failureP3 download thread failed with: P3 upload thread failed with: P3 upload thread failed with: CUDA error 719: unspecified launch failure

CUDA error 719: unspecified launch failure
CUDA error 719: unspecified launch failure
P3 upload thread failed with: CUDA error 719: unspecified launch failureP3 download thread failed with: CUDA error 719: unspecified launch failure

terminate called recursively
terminate called after throwing an instance of 'std::runtime_error'
terminate called recursively
terminate called recursively
terminate called recursively

Dmesg:


[  836.759676] NVRM: GPU at PCI:0000:0a:00: GPU-760d0170-b7c5-6817-9414-6cc7112645de
[  836.759680] NVRM: Xid (PCI:0000:0a:00): 62, pid='<unknown>', name=<unknown>, 00000000 00000000 00000000 00000000 00220030 00300000 00000000 00000000
[  836.762804] NVRM: Xid (PCI:0000:0a:00): 45, pid=1125, name=ollama, Ch 00000008
[  836.766227] NVRM: Xid (PCI:0000:0a:00): 45, pid=1125, name=ollama, Ch 00000009
[  836.769523] NVRM: Xid (PCI:0000:0a:00): 45, pid=1125, name=ollama, Ch 0000000a
[  836.772789] NVRM: Xid (PCI:0000:0a:00): 45, pid=1125, name=ollama, Ch 0000000b
[  836.776044] NVRM: Xid (PCI:0000:0a:00): 45, pid=1125, name=ollama, Ch 0000000c
[  836.779293] NVRM: Xid (PCI:0000:0a:00): 45, pid=1125, name=ollama, Ch 0000000d
[  836.782548] NVRM: Xid (PCI:0000:0a:00): 45, pid=1125, name=ollama, Ch 0000000e
[  836.785801] NVRM: Xid (PCI:0000:0a:00): 45, pid=1125, name=ollama, Ch 0000000f
[  836.789057] NVRM: Xid (PCI:0000:0a:00): 45, pid=1996, name=cuda_plot_k32_v, Ch 00000018
[  836.897654] NVRM: Xid (PCI:0000:0a:00): 45, pid=1996, name=cuda_plot_k32_v, Ch 00000019
[  836.898422] NVRM: Xid (PCI:0000:0a:00): 45, pid=1996, name=cuda_plot_k32_v, Ch 0000001a
[  836.899182] NVRM: Xid (PCI:0000:0a:00): 45, pid=1996, name=cuda_plot_k32_v, Ch 0000001b
[  836.899945] NVRM: Xid (PCI:0000:0a:00): 45, pid=1996, name=cuda_plot_k32_v, Ch 0000001c
[  836.900716] NVRM: Xid (PCI:0000:0a:00): 45, pid=1996, name=cuda_plot_k32_v, Ch 0000001d
[  836.901482] NVRM: Xid (PCI:0000:0a:00): 45, pid=1996, name=cuda_plot_k32_v, Ch 0000001e
[  836.902244] NVRM: Xid (PCI:0000:0a:00): 45, pid=1996, name=cuda_plot_k32_v, Ch 0000001f
[  866.948561] NVRM: Xid (PCI:0000:0a:00): 31, pid=1996, name=cuda_plot_k32_v, Ch 00000027, intr 00000000. MMU Fault: ENGINE CE3 HUBCLIENT_CE1 faulted @ 0x7ef8_6458b000. Fault is of type FAULT_RO_VIOLATION ACCESS_TYPE_VIRT_WRITE
madMAx43v3r commented 7 months ago

It happens every time on first plot? And single 3090? Is it overclocked?

mark_used(): out of range: 5448287559
mark_used(): out of range: 5448287942

Because this looks like unstable VRAM.

It could also be a failing -2 drive, you're plotting with 128G RAM right?

cdgraff commented 7 months ago

Yes first plot always, single 3090, no overclocked, and yes 128gb ram RAID0

Into same hardware I plot during last 10 dias, arround 2k plots C31, I normally download latest version from GIT and test... I don't know if related to that, or can be something bronken into other place? I'll try without raid for -2


# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Mon Apr  8 14:22:55 2024
        Raid Level : raid0
        Array Size : 1953260544 (1862.77 GiB 2000.14 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Mon Apr  8 14:22:55 2024
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : -unknown-
        Chunk Size : 512K

Consistency Policy : none

              Name : edge-ml-1:0  (local to host edge-ml-1)
              UUID : c8c16ec2:8bc43c31:90f7d8a0:cf54af07
            Events : 0

    Number   Major   Minor   RaidDevice State
       0     259        0        0      active sync   /dev/nvme0n1
       1     259        1        1      active sync   /dev/nvme1n1
madMAx43v3r commented 7 months ago

Try RAID1 to see if the NVMe are still good.

cdgraff commented 7 months ago

I tested without RAID each disk individually, and tested with a SSD too, without use the NVME and same error.

I'll test changing the GPU for other... to discard something related to the GPU card.

cdgraff commented 7 months ago

OK, i replace the 3090 with a 3060 and failure stop, but 3090 look working without problem with Mining other coins and using into AI, that made me think possible is not broken, but incompatible with something... I'll close the ticket as look something really particular with my hardware.