vdsm / virtual-dsm

Virtual DSM in a Docker container.
MIT License
2.64k stars 353 forks source link

Vdsm auto shutdown problem #367

Closed SoraKasvgano closed 11 months ago

SoraKasvgano commented 11 months ago

thanks for this great project......everything works fine except auto shutdown....almost every 6 hours. .

LOG CENTER:

user:SYSTEM_ADMIN System started counting down to shutdown

after i checked logs ,i found NMI problems....How to avoid it being shutdown? Thanks!!!

[112107.911050] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 41s! [snmpd:12366]
[112108.039582] Modules linked in: fuse vhost_scsi(O) vhost(O) tcm_loop(O) iscsi_target_mod(O) target_core_user(O) target_core_ep(O) target_core_multi_file(O) target_core_file(O) target_core_iblock(O) target_core_mod(O) syno_extent_pool(PO) rodsp_ep(O) udf isofs synoacl_vfs(O) btrfs ecryptfs zstd_decompress zstd_compress xxhash xor raid6_pq 8021q usb_storage aesni_intel glue_helper lrw gf128mul ablk_helper kvmx64_synobios(O) hid_generic usbhid hid usblp ixgbevf(O) igbvf(O) i40evf(O) bnxt_en(O) qede(O) qed(O) be2net(O) zlib_deflate dm_crypt sg dm_snapshot dm_bufio dm_mod crc_itu_t crc_ccitt psnap p8022 llc hfsplus md4 hmac sit tunnel4 ipv6 arc4 crc32c_intel cryptd ecb aes_x86_64 authenc des_generic ansi_cprng cts md5 cbc vxlan ip6_udp_tunnel udp_tunnel ip_tunnel zram loop virtio_rng virtio_console virtio_scsi virtio_net virtio_blk virtio_pci virtio_ring virtio sha256_generic synorbd(O) synofsbd(O) etxhci_hcd xhci_hcd uhci_hcd ehci_pci ehci_hcd usbcore usb_common [last unloaded: kvmx64_synobios]
[112108.601035] CPU: 0 PID: 12366 Comm: snmpd Tainted: P           O    4.4.302+ #69057
[112108.603367] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[112108.606621] task: ffff880017d8cd40 ti: ffff880017e98000 task.ti: ffff880017e98000
[112108.609307] RIP: 0010:[<ffffffff81005cc4>]  [<ffffffff81005cc4>] arch_irq_stat_cpu+0x14/0x90
[112108.627036] RSP: 0018:ffff880017e9bd58  EFLAGS: 00010246
[112108.628867] RAX: ffff88003f900000 RBX: 0000000000000001 RCX: 0000000000019d00
[112108.631184] RDX: 0000000000019d00 RSI: 0000000000000000 RDI: 0000000000000001
[112108.633393] RBP: ffff880017e9be38 R08: 000065ec2ec152d5 R09: 0000000000000003
[112108.635695] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000024d4555
[112108.638023] R13: 0000000005f444b9 R14: 00000000000142e0 R15: 0000000000000001
[112108.640183] FS:  00007f67feab4e00(0000) GS:ffff88003f800000(0000) knlGS:0000000000000000
[112108.642596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[112108.644441] CR2: 00007fe90c0d84d0 CR3: 0000000017d23000 CR4: 00000000003606f0
[112108.648987] Stack:
[112108.649696]  ffffffff811f7d6d 0000000201623380 00000000654a4540 ffff88000771fe80
[112108.652096]  0000000000014340 00000000000da509 00000000051ef236 0000000000000000
[112108.654452]  0000000000000000 0000000000117db9 0000000000000000 000000000087e923
[112108.656839] Call Trace:
[112108.661398]  [<ffffffff811f7d6d>] ? show_stat+0x1ad/0x630
[112108.663535]  [<ffffffff811a9b38>] seq_read+0xa8/0x3d0
[112108.665071]  [<ffffffff811ef340>] proc_reg_read+0x40/0x90
[112108.666928]  [<ffffffff81180369>] ? rw_verify_area+0x49/0xd0
[112108.668753]  [<ffffffff8117f776>] __vfs_read+0x16/0x30
[112108.670251]  [<ffffffff8118047d>] vfs_read+0x8d/0x140
[112108.671810]  [<ffffffff81181361>] SyS_read+0x61/0xd0
[112108.674164]  [<ffffffff8154aa21>] entry_SYSCALL_64_fastpath+0x1e/0x9a
[112108.676120] Code: b0 d2 6f 81 e8 8e 43 1a 00 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 90 89 ff 48 c7 c2 00 9d 01 00 48 8b 04 fd 60 4a 8d 81 48 89 d1 <8b> 7c 08 04 8b 74 08 08 48 01 fe 8b 7c 08 0c 48 01 f7 8b 74 08 
[112108.688665] Sending NMI to other CPUs:
[112108.708235] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.386 msecs
[112108.708964] NMI backtrace for cpu 1
[112108.708964] CPU: 1 PID: 6431 Comm: smbd Tainted: P           O    4.4.302+ #69057
[112108.708964] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[112108.708965] task: ffff88002a3d3140 ti: ffff880028530000 task.ti: ffff880028530000
[112108.708965] RIP: 0010:[<ffffffff812d8ffe>]  [<ffffffff812d8ffe>] __radix_tree_lookup+0x7e/0xa0
[112108.708965] RSP: 0000:ffff880028533d78  EFLAGS: 00000202
[112108.708966] RAX: ffff880005d98d88 RBX: ffffffff8185a700 RCX: 0000000000000000
[112108.708966] RDX: ffff880005d98f10 RSI: 0000000000000001 RDI: ffffea000043a740
[112108.708966] RBP: ffff880028533d88 R08: 0200000000005fec R09: ffff880028533d80
[112108.708966] R10: 0000000000000000 R11: ffff880005d98d88 R12: 0200000000005fec
[112108.708966] R13: 0000000000000000 R14: ffffffff8185a6f8 R15: 0200000000005fec
[112108.708967] FS:  00007fe908878ec0(0000) GS:ffff88003f900000(0000) knlGS:0000000000000000
[112108.708967] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[112108.708967] CR2: 00007fe909bd0818 CR3: 0000000011c9f000 CR4: 00000000003606f0
[112108.708967] Stack:
[112108.708968]  ffffffff812d9033 ffff88003d5a8328 ffff880028533da8 ffffffff8112b4b9
[112108.708968]  00007fe909bd0818 0000000000000000 ffff880028533de8 ffffffff8112c238
[112108.708968]  ffffffff81094881 00007fe909bd0818 0000000000000054 ffff8800379df100
[112108.708968] Call Trace:
[112108.708969]  [<ffffffff812d9033>] ? radix_tree_lookup_slot+0x13/0x20
[112108.708969]  [<ffffffff8112b4b9>] find_get_entry+0x19/0x70
[112108.708969]  [<ffffffff8112c238>] pagecache_get_page+0x28/0x1c0
[112108.708969]  [<ffffffff81094881>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[112108.708970]  [<ffffffff8116a2b7>] lookup_swap_cache+0x27/0x60
[112108.708970]  [<ffffffff81157f49>] handle_mm_fault+0x739/0x1430
[112108.708970]  [<ffffffff8104826d>] __do_page_fault+0x16d/0x3e0
[112108.708970]  [<ffffffff8104854c>] trace_do_page_fault+0x3c/0xf0
[112108.708970]  [<ffffffff8104358f>] do_async_page_fault+0x4f/0x70
[112108.708971]  [<ffffffff8154cc28>] async_page_fault+0x28/0x30
[112108.709019] Code: 00 61 8d 81 72 3f 8d 14 76 8d 4c 12 fa eb 08 83 e9 06 83 ee 01 74 1d 4c 89 c2 49 89 c3 48 d3 ea 83 e2 3f 48 8d 54 d0 28 48 8b 3a <48> 89 f8 48 85 ff 75 dc c3 4d 85 d2 74 03 4d 89 1a 4d 85 c9 74 
❯ Received shutdown request through NMI..
❯ VirtualDSM Agent: Shutting down..

what's it function?

https://github.com/vdsm/virtual-dsm/blob/master/agent/agent.sh
function checkNMI {

  local nmi
  nmi=$(cat /proc/interrupts | grep NMI | sed 's/[^1-9]*//g')

  if [ "$nmi" != "" ]; then

    info "Received shutdown request through NMI.."

    /usr/syno/sbin/synoshutdown -s > /dev/null
    finish

  fi

changes:

ENABLED AME curl http://code.imnks.com/ame3patch/ame72-3005.py | python ENABLED VideoStation support DTS、EAC3 and TrueHD curl https://raw.githubusercontent.com/AlexPresso/VideoStation-FFMPEG-Patcher/main/patcher.sh | bash -s -- -p https://ghproxy.com/https://github.com -v 5

ENABLED Synology Photos face identify

wget http://code.imnks.com/face/PatchELFSharp
chmod +x PatchELFSharp
# support face and concept
./PatchELFSharp "/var/packages/SynologyPhotos/target/usr/lib/libsynophoto-plugin-platform.so.1.0" "_ZN9synophoto6plugin8platform20IsSupportedIENetworkEv" "B8 00 00 00 00 C3"
# force to support concept
./PatchELFSharp "/var/packages/SynologyPhotos/target/usr/lib/libsynophoto-plugin-platform.so.1.0" "_ZN9synophoto6plugin8platform18IsSupportedConceptEv" "B8 01 00 00 00 C3"
# force no Gpu
./PatchELFSharp "/var/packages/SynologyPhotos/target/usr/lib/libsynophoto-plugin-platform.so.1.0" "_ZN9synophoto6plugin8platform23IsSupportedIENetworkGpuEv" "B8 00 00 00 00 C3"
kroese commented 11 months ago

The real problem is this message:

BUG: soft lockup - CPU#0 stuck for 41s!

I am not sure why one of your CPU cores get stuck after 6 hours (maybe powersaving feature, or hardware failure, or bug in qemu or your kernel..) In any case its not normal.

Then that triggers an NMI interrupt and in old versions of this container I used the NMI interrupt as a way to signal a shutdown request.

The good news is that in the more recent version I use a different method to shutdown, so this code that listens for NMI interrupts has become totally obsolete now and can be removed without a problem.

The easiest ways to do that is to just remove these two scripts inside your DSM:

/usr/local/bin/agent.sh
/usr/local/etc/rc.d/agent.sh

Another way is to remove all files inside your /storage folder on your host (except data.img to keep your files) and run the latest container version (v4.25). And it will do a fresh installation of DSM without this agent.sh script.

That still does not solve the soft lockup problem of your CPU, so you will still see that crash every six hours I think. But at least it will not cause the system to shutdown anymore.

SoraKasvgano commented 11 months ago

The real problem is this message:

BUG: soft lockup - CPU#0 stuck for 41s!

I am not sure why one of your CPU cores get stuck after 6 hours (maybe powersaving feature, or hardware failure, or bug in qemu or your kernel..) In any case its not normal.

Then that triggers an NMI interrupt and in old versions of this container I used the NMI interrupt as a way to signal a shutdown request.

The good news is that in the more recent version I use a different method to shutdown, so this code that listens for NMI interrupts has become totally obsolete now and can be removed without a problem.

The easiest ways to do that is to just remove these two scripts inside your DSM:

/usr/local/bin/agent.sh
/usr/local/etc/rc.d/agent.sh

Another way is to remove all files inside your /storage folder on your host (except data.img to keep your files) and run the latest container version (v4.25). And it will do a fresh installation of DSM without this agent.sh script.

That still does not solve the soft lockup problem of your CPU, so you will still see that crash every six hours I think. But at least it will not cause the system to shutdown anymore.

Thanks dude ,i will have a try!!!!

SoraKasvgano commented 11 months ago

The real problem is this message:

BUG: soft lockup - CPU#0 stuck for 41s!

I am not sure why one of your CPU cores get stuck after 6 hours (maybe powersaving feature, or hardware failure, or bug in qemu or your kernel..) In any case its not normal.

Then that triggers an NMI interrupt and in old versions of this container I used the NMI interrupt as a way to signal a shutdown request.

The good news is that in the more recent version I use a different method to shutdown, so this code that listens for NMI interrupts has become totally obsolete now and can be removed without a problem.

The easiest ways to do that is to just remove these two scripts inside your DSM:

/usr/local/bin/agent.sh
/usr/local/etc/rc.d/agent.sh

Another way is to remove all files inside your /storage folder on your host (except data.img to keep your files) and run the latest container version (v4.25). And it will do a fresh installation of DSM without this agent.sh script.

That still does not solve the soft lockup problem of your CPU, so you will still see that crash every six hours I think. But at least it will not cause the system to shutdown anymore.

looks like Synology has officialy fixed this problem

·https://www.synology.com/en-us/releaseNote/DSM#ver_69057-2·

(2023-Oct-15th)V7.2.1-69057 Update 2