Open arista-nwolfe opened 4 days ago
As indicated in https://github.com/aristanetworks/sonic/issues/109 during reboot tests (module api platform tests) a kernel panic can occur on the supervisor, this was introduced in the kernel upgrade to 6.1.94 (6.1.0-22-2) https://github.com/sonic-net/sonic-buildimage/pull/19885
2024 Nov 14 23:21:43.961688 str2-7804-sup-1 INFO kernel: [ 1284.481956] br1: port 13(lc7.42) entered disabled state 2024 Nov 14 23:21:44.025742 str2-7804-sup-1 INFO lc-interface-config[52854]: remove interface lc7 slot_id= 2024 Nov 14 23:21:44.078064 str2-7804-sup-1 INFO kernel: [ 1284.597508] pcieport 0000:73:0d.0: pciehp: Timeout on hotplug command 0x1038 (issued 1183788 msec ago) 2024 Nov 14 23:21:44.693686 str2-7804-sup-1 ERR kernel: [ 1285.105921] pcieport 0000:73:02.0: Unable to change power state from D3hot to D0, device inaccessible 2024 Nov 14 23:21:46.309702 str2-7804-sup-1 INFO kernel: [ 1286.721489] pcieport 0000:73:0d.0: pciehp: Timeout on hotplug command 0x0000 (issued 2124 msec ago) 2024 Nov 14 23:21:46.309722 str2-7804-sup-1 ERR kernel: [ 1286.721728] pcieport 0000:73:02.0: Unable to change power state from D3cold to D0, device inaccessible 2024 Nov 14 23:21:46.345683 str2-7804-sup-1 INFO kernel: [ 1286.834051] pci_bus 0000:74: busn_res: [bus 74] is released 2024 Nov 14 23:21:46.345705 str2-7804-sup-1 INFO kernel: [ 1286.834570] pci 0000:73:02.0: Removing from iommu group 20 2024 Nov 14 23:21:46.345707 str2-7804-sup-1 INFO kernel: [ 1286.834649] pci 0000:75:00.0: Removing from iommu group 20 2024 Nov 14 23:21:46.345708 str2-7804-sup-1 WARNING kernel: [ 1286.839869] general protection fault, probably for non-canonical address 0x32b727d667b7999a: 0000 [#1] PREEMPT SMP PTI 2024 Nov 14 23:21:51.054518 str2-7804-sup-1 WARNING kernel: [ 1286.968107] CPU: 11 PID: 151 Comm: irq/46-pciehp Tainted: G OE 6.1.0-22-2-amd64 #1 Debian 6.1.94-1 2024 Nov 14 23:21:51.054538 str2-7804-sup-1 WARNING kernel: [ 1287.092181] Hardware name: Intel Camelback Mountain CRB/Camelback Mountain CRB, BIOS Aboot-norcal7-7.1.4-14169220 11/09/2019 2024 Nov 14 23:21:51.054540 str2-7804-sup-1 WARNING kernel: [ 1287.226668] RIP: 0010:pcie_config_aspm_link+0x48/0x330 2024 Nov 14 23:21:51.054541 str2-7804-sup-1 WARNING kernel: [ 1287.288242] Code: 48 8b 04 25 28 00 00 00 48 89 44 24 30 31 c0 8b 47 30 4c 8b 47 08 83 e3 7f c1 e8 0e f7 d3 89 c2 83 e0 7f 21 c3 83 e2 7f 21 f3 <41> 8b b6 a0 00 00 00 89 d8 83 e0 87 f6 c3 04 0f 44 d8 0f b7 47 30 2024 Nov 14 23:21:51.054543 str2-7804-sup-1 WARNING kernel: [ 1287.513355] RSP: 0000:ffffa81a0053bcb8 EFLAGS: 00010246 2024 Nov 14 23:21:51.054544 str2-7804-sup-1 WARNING kernel: [ 1287.575967] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 2024 Nov 14 23:21:51.054545 str2-7804-sup-1 WARNING kernel: [ 1287.661493] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9a41c6c35480 2024 Nov 14 23:21:51.054546 str2-7804-sup-1 WARNING kernel: [ 1287.747022] RBP: ffff9a41c6c35480 R08: ffff9a424d08bf49 R09: ffffa81a0053bc6c 2024 Nov 14 23:21:51.054547 str2-7804-sup-1 WARNING kernel: [ 1287.832549] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9a41c1016000 2024 Nov 14 23:21:51.054548 str2-7804-sup-1 WARNING kernel: [ 1287.918078] R13: ffff9a41c5435028 R14: 32b727d667b798fa R15: ffff9a41c0ec3920 2024 Nov 14 23:21:51.054549 str2-7804-sup-1 WARNING kernel: [ 1288.003606] FS: 0000000000000000(0000) GS:ffff9a50ffcc0000(0000) knlGS:0000000000000000 2024 Nov 14 23:21:51.054550 str2-7804-sup-1 WARNING kernel: [ 1288.100593] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2024 Nov 14 23:21:51.054550 str2-7804-sup-1 WARNING kernel: [ 1288.169454] CR2: 00007fb55fdf5030 CR3: 0000000101044001 CR4: 00000000003706e0 2024 Nov 14 23:21:51.054551 str2-7804-sup-1 WARNING kernel: [ 1288.254982] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2024 Nov 14 23:21:51.054552 str2-7804-sup-1 WARNING kernel: [ 1288.340509] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 2024 Nov 14 23:21:51.054553 str2-7804-sup-1 WARNING kernel: [ 1288.426039] Call Trace: 2024 Nov 14 23:21:51.054554 str2-7804-sup-1 WARNING kernel: [ 1288.455317] <TASK> 2024 Nov 14 23:21:51.054555 str2-7804-sup-1 WARNING kernel: [ 1288.480430] ? __die_body.cold+0x1a/0x1f 2024 Nov 14 23:21:51.054555 str2-7804-sup-1 WARNING kernel: [ 1288.527428] ? die_addr+0x38/0x60 2024 Nov 14 23:21:51.054556 str2-7804-sup-1 WARNING kernel: [ 1288.567128] ? exc_general_protection+0x221/0x4a0 2024 Nov 14 23:21:51.054557 str2-7804-sup-1 WARNING kernel: [ 1288.623496] ? asm_exc_general_protection+0x22/0x30 2024 Nov 14 23:21:51.054558 str2-7804-sup-1 WARNING kernel: [ 1288.681954] ? pcie_config_aspm_link+0x48/0x330 2024 Nov 14 23:21:51.054559 str2-7804-sup-1 WARNING kernel: [ 1288.736243] pcie_aspm_exit_link_state+0xb9/0x120 2024 Nov 14 23:21:51.054559 str2-7804-sup-1 WARNING kernel: [ 1288.792612] pci_remove_bus_device+0xc8/0x110 2024 Nov 14 23:21:51.054560 str2-7804-sup-1 WARNING kernel: [ 1288.844818] pci_remove_bus_device+0x2e/0x110 2024 Nov 14 23:21:51.054561 str2-7804-sup-1 WARNING kernel: [ 1288.897026] pci_remove_bus_device+0x3e/0x110 2024 Nov 14 23:21:51.054562 str2-7804-sup-1 WARNING kernel: [ 1288.949234] pciehp_unconfigure_device+0x94/0x160 2024 Nov 14 23:21:51.054563 str2-7804-sup-1 WARNING kernel: [ 1289.005609] pciehp_disable_slot+0x69/0x100 2024 Nov 14 23:21:51.054564 str2-7804-sup-1 WARNING kernel: [ 1289.055731] pciehp_handle_presence_or_link_change+0x241/0x350 2024 Nov 14 23:21:51.054564 str2-7804-sup-1 WARNING kernel: [ 1289.125642] pciehp_ist+0x164/0x170 2024 Nov 14 23:21:51.054575 str2-7804-sup-1 WARNING kernel: [ 1289.167433] ? disable_irq_nosync+0x10/0x10 2024 Nov 14 23:21:51.054577 str2-7804-sup-1 WARNING kernel: [ 1289.217548] irq_thread_fn+0x1f/0x60 2024 Nov 14 23:21:51.054578 str2-7804-sup-1 WARNING kernel: [ 1289.260374] irq_thread+0xfa/0x1c0 2024 Nov 14 23:21:51.054578 str2-7804-sup-1 WARNING kernel: [ 1289.301116] ? irq_thread_fn+0x60/0x60 2024 Nov 14 23:21:51.054579 str2-7804-sup-1 WARNING kernel: [ 1289.346024] ? irq_thread_check_affinity+0xf0/0xf0 2024 Nov 14 23:21:51.054580 str2-7804-sup-1 WARNING kernel: [ 1289.403432] kthread+0xda/0x100 2024 Nov 14 23:21:51.054584 str2-7804-sup-1 WARNING kernel: [ 1289.441043] ? kthread_complete_and_exit+0x20/0x20 2024 Nov 14 23:21:51.054585 str2-7804-sup-1 WARNING kernel: [ 1289.498448] ret_from_fork+0x22/0x30 2024 Nov 14 23:21:51.054585 str2-7804-sup-1 WARNING kernel: [ 1289.541273] </TASK> 2024 Nov 14 23:21:51.054586 str2-7804-sup-1 WARNING kernel: [ 1289.567422] Modules linked in: nft_meta_bridge(E) 8021q(E) garp(E) mrp(E) lm75(E) linux_ngbde(OE) linux_knet_cb(OE) linux_bcm_knet(OE) psample(E) linux_user_bde(OE) linux_kernel_bde(OE) xt_hl(E) xt_tcpudp(E) ip6_tables(E) xt_conntrack(E) ebt_vlan(E) nft_compat(E) nf_tables(E) tmp468(OE) amax31790(OE) veth(E) pmbus(E) pmbus_core(E) nf_conntrack_netlink(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xfrm_user(E) i2c_mux_pca9541(E) i2c_mux(E) optoe(E) lm90(E) at24(E) regmap_i2c(E) scd_hwmon(OE) i2c_dev(E) eeprom(E) bridge(E) stp(E) llc(E) nvme_fabrics(E) binfmt_misc(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) bonding(E) tls(E) irqbypass(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha512_generic(E) sha256_ssse3(E) sha1_ssse3(E) aesni_intel(E) crypto_simd(E) cryptd(E) rapl(E) intel_cstate(E) intel_uncore(E) iTCO_wdt(E) evdev(E) 2024 Nov 14 23:21:51.054588 str2-7804-sup-1 WARNING kernel: [ 1289.567494] ofpart(E) intel_pmc_bxt(E) scd(OE) spi_nor(E) iTCO_vendor_support(E) pcspkr(E) mtd(E) intel_pch_thermal(E) uio(E) watchdog(E) sg(E) ioatdma(E) button(E) nfnetlink(E) fuse(E) efi_pstore(E) dm_mod(E) drm(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) loop(E) ext4(E) crc16(E) mbcache(E) jbd2(E) crc32c_generic(E) zstd(E) zstd_compress(E) nvme(E) nvme_core(E) nls_utf8(E) nls_cp437(E) nls_ascii(E) vfat(E) fat(E) overlay(E) squashfs(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) crct10dif_generic(E) ahci(E) libahci(E) ixgbe(E) xhci_pci(E) crct10dif_pclmul(E) spi_intel_platform(E) xfrm_algo(E) crct10dif_common(E) spi_intel(E) gpio_ich(E) libata(E) ehci_pci(E) dca(E) crc32_pclmul(E) xhci_hcd(E) ehci_hcd(E) mdio_devres(E) of_mdio(E) crc32c_intel(E) i2c_i801(E) scsi_mod(E) lpc_ich(E) fixed_phy(E) i2c_smbus(E) scsi_common(E) usbcore(E) tg3(E) fwnode_mdio(E) usb_common(E) libphy(E) mdio(E) 2024 Nov 14 23:21:51.054592 str2-7804-sup-1 WARNING kernel: [ 1291.578230] sched: RT throttling activated 2024 Nov 14 23:21:51.103876 str2-7804-sup-1 WARNING kernel: [ 1291.578551] ---[ end trace 0000000000000000 ]--- 2024 Nov 14 23:21:51.220783 str2-7804-sup-1 WARNING kernel: [ 1291.682963] RIP: 0010:pcie_config_aspm_link+0x48/0x330 2024 Nov 14 23:21:51.220806 str2-7804-sup-1 WARNING kernel: [ 1291.744550] Code: 48 8b 04 25 28 00 00 00 48 89 44 24 30 31 c0 8b 47 30 4c 8b 47 08 83 e3 7f c1 e8 0e f7 d3 89 c2 83 e0 7f 21 c3 83 e2 7f 21 f3 <41> 8b b6 a0 00 00 00 89 d8 83 e0 87 f6 c3 04 0f 44 d8 0f b7 47 30 2024 Nov 14 23:21:51.508531 str2-7804-sup-1 WARNING kernel: [ 1291.969674] RSP: 0000:ffffa81a0053bcb8 EFLAGS: 00010246 2024 Nov 14 23:21:51.508552 str2-7804-sup-1 WARNING kernel: [ 1292.032297] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 2024 Nov 14 23:21:51.679604 str2-7804-sup-1 WARNING kernel: [ 1292.117829] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9a41c6c35480 2024 Nov 14 23:21:51.679626 str2-7804-sup-1 WARNING kernel: [ 1292.203366] RBP: ffff9a41c6c35480 R08: ffff9a424d08bf49 R09: ffffa81a0053bc6c 2024 Nov 14 23:21:51.850678 str2-7804-sup-1 WARNING kernel: [ 1292.288901] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9a41c1016000 2024 Nov 14 23:21:51.850701 str2-7804-sup-1 WARNING kernel: [ 1292.374438] R13: ffff9a41c5435028 R14: 32b727d667b798fa R15: ffff9a41c0ec3920 2024 Nov 14 23:21:52.033223 str2-7804-sup-1 WARNING kernel: [ 1292.459975] FS: 0000000000000000(0000) GS:ffff9a50ffcc0000(0000) knlGS:0000000000000000 2024 Nov 14 23:21:52.033244 str2-7804-sup-1 WARNING kernel: [ 1292.556981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2024 Nov 14 23:21:52.108423 str2-7804-sup-1 WARNING pmon#chassisd: Unexpected: Module LINE-CARD4 (Slot 7) lost midplane connectivity 2024 Nov 14 23:21:52.187648 str2-7804-sup-1 WARNING kernel: [ 1292.625861] CR2: 00007fb55fdf5030 CR3: 0000000101044001 CR4: 00000000003706e0 2024 Nov 14 23:21:52.187669 str2-7804-sup-1 WARNING kernel: [ 1292.711408] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Message from syslogd@str2-7804-sup-1 at Nov 14 23:21:52 ... kernel:[ 1292.882490] Kernel panic - not syncing: Fatal exception 2024 Nov 14 23:21:52.358732 str2-7804-sup-1 WARNING kernel: [ 1292.796949] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 2024 Nov 14 23:21:52.358755 str2-7804-sup-1 EMERG kernel: [ 1292.882490] Kernel panic - not syncing: Fatal exception
Upon further investigation it was this specific change that seems to have caused this kernel panic: https://github.com/torvalds/linux/commit/456d8aa37d0f56fc9e985e812496e861dcd6f2f2
We can see this commit is present when comparing the previous version (6.1.38) https://elixir.free-electrons.com/linux/v6.1.38/source/drivers/pci/pcie/aspm.c#L1003 And the newer version (6.1.94) https://elixir.free-electrons.com/linux/v6.1.94/source/drivers/pci/pcie/aspm.c#L1018
@saiarcot895, can you please help with this issue
As indicated in https://github.com/aristanetworks/sonic/issues/109 during reboot tests (module api platform tests) a kernel panic can occur on the supervisor, this was introduced in the kernel upgrade to 6.1.94 (6.1.0-22-2) https://github.com/sonic-net/sonic-buildimage/pull/19885
Upon further investigation it was this specific change that seems to have caused this kernel panic: https://github.com/torvalds/linux/commit/456d8aa37d0f56fc9e985e812496e861dcd6f2f2
We can see this commit is present when comparing the previous version (6.1.38) https://elixir.free-electrons.com/linux/v6.1.38/source/drivers/pci/pcie/aspm.c#L1003 And the newer version (6.1.94) https://elixir.free-electrons.com/linux/v6.1.94/source/drivers/pci/pcie/aspm.c#L1018