sophgo / linux-riscv

Linux kernel stable tree
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/
Other
25 stars 54 forks source link

Unable to handle kernel paging request at virtual address ffffffd7fde00000 #117

Open ncopa opened 5 months ago

ncopa commented 5 months ago

Built from commit 83ab3eda46e651464f2715455ae66711882be116

nld-bld-1 [~]$ uname -a
Linux nld-bld-1 6.1.80-0-sophgo #1-Alpine SMP PREEMPT Wed, 06 Mar 2024 20:52:40 +0000 riscv64 Linux
[    0.000000] Linux version 6.1.80-0-sophgo (buildozer@build-edge-riscv64) (gcc (Alpine 13.2.1_git20231014) 13.2.1 20231014, GNU ld (GNU Binutils) 2.42) #1-Alpine SMP PREEMPT Wed, 06 Mar 2024 20:52:40 +0000
[    0.000000] OF: fdt: Ignoring memory range 0x0 - 0x2200000
[    0.000000] Machine model: Sophgo Mango
[    0.000000] earlycon: uart0 at MMIO32 0x0000007040000000 (options '')
[    0.000000] printk: bootconsole [uart0] enabled
[    0.000000] efi: UEFI not found.
[    0.000000] OF: NUMA: parsing numa-distance-map-v1
[    0.000000] NUMA: NODE_DATA [mem 0x7ffffde80-0x7ffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0xfffffde80-0xfffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x17ffffde80-0x17ffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x1f02138dc0-0x1f0213af3f]
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000002200000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x0000001f021fffff]
[    0.000000]   HighMem  [mem 0x0000001f02200000-0x0000001fffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000002200000-0x00000000bfffffff]
[    0.000000]   node   0: [mem 0x0000000100000000-0x00000007ffffffff]
[    0.000000]   node   1: [mem 0x0000000800000000-0x0000000fffffffff]
[    0.000000]   node   2: [mem 0x0000001000000000-0x00000017ffffffff]
[    0.000000]   node   3: [mem 0x0000001800000000-0x0000001fffffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000002200000-0x00000007ffffffff]
[    0.000000] Initmem setup node 1 [mem 0x0000000800000000-0x0000000fffffffff]
[    0.000000] Initmem setup node 2 [mem 0x0000001000000000-0x00000017ffffffff]
[    0.000000] Initmem setup node 3 [mem 0x0000001800000000-0x0000001fffffffff]
[    0.000000] On node 0, zone DMA32: 8704 pages in unavailable ranges
[    0.000000] SBI specification v1.0 detected
[    0.000000] SBI implementation ID=0x1 Version=0x10002
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions acdfimv
[    0.000000] riscv: ELF capabilities acdfimv
[    0.000000] percpu: cpu 0 has no node -1 or node-local memory
[    0.000000] percpu: Embedded 19 pages/cpu s38440 r8192 d31192 u77824
[    0.000000] pcpu-alloc: s38440 r8192 d31192 u77824 alloc=19*4096
[    0.000000] pcpu-alloc: [0] 00 [0] 01 [0] 02 [0] 03 [0] 04 [0] 05 [0] 06 [0] 07 
[    0.000000] pcpu-alloc: [0] 08 [0] 09 [0] 10 [0] 11 [0] 12 [0] 13 [0] 14 [0] 15 
[    0.000000] pcpu-alloc: [0] 16 [0] 17 [0] 18 [0] 19 [0] 20 [0] 21 [0] 22 [0] 23 
[    0.000000] pcpu-alloc: [0] 24 [0] 25 [0] 26 [0] 27 [0] 28 [0] 29 [0] 30 [0] 31 
[    0.000000] pcpu-alloc: [0] 32 [0] 33 [0] 34 [0] 35 [0] 36 [0] 37 [0] 38 [0] 39 
[    0.000000] pcpu-alloc: [0] 40 [0] 41 [0] 42 [0] 43 [0] 44 [0] 45 [0] 46 [0] 47 
[    0.000000] pcpu-alloc: [0] 48 [0] 49 [0] 50 [0] 51 [0] 52 [0] 53 [0] 54 [0] 55 
[    0.000000] pcpu-alloc: [0] 56 [0] 57 [0] 58 [0] 59 [0] 60 [0] 61 [0] 62 [0] 63 
[    0.000000] Fallback order for Node 0: 0 1 2 3 
[    0.000000] Fallback order for Node 1: 1 0 3 2 
[    0.000000] Fallback order for Node 2: 2 3 0 1 
[    0.000000] Fallback order for Node 3: 3 2 1 0 
[    0.000000] Built 4 zonelists, mobility grouping on.  Total pages: 32779776
[    0.000000] Policy zone: HighMem
[    0.000000] Kernel command line: root=UUID=01e6ba3a-67ba-4381-8565-584e7cd1747b rw earlycon modules=sd-mod,usb-storage,ext4 console=tty1 console=ttyS0,115200
[    0.000000] Unknown kernel command line parameters "modules=sd-mod,usb-storage,ext4", will be passed to user space.
[    0.000000] printk: log_buf_len individual max cpu contribution: 8192 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 516096 bytes
[    0.000000] printk: log_buf_len min size: 262144 bytes
[    0.000000] printk: log_buf_len: 1048576 bytes
[    0.000000] printk: early log buf free: 258480(98%)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] software IO TLB: area num 64.
[    0.000000] software IO TLB: mapped [mem 0x00000000bbfff000-0x00000000bffff000] (64MB)
[    0.000000] Memory: 130914456K/133134336K available (7121K kernel code, 5310K rwdata, 2048K rodata, 2168K init, 660K bss, 2217852K reserved, 0K cma-reserved, 4159488K highmem)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=64, Nodes=4
[    0.000000] rcu: Preemptible hierarchical RCU implementation.
[    0.000000]  Trampoline variant of Tasks RCU enabled.
[    0.000000]  Tracing variant of Tasks RCU enabled.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 64 local interrupts mapped

...

[   22.268171] EXT4-fs (nvme0n1p2): orphan cleanup on readonly fs
[   22.282837] EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Quota mode: none.
[   22.300623] Mounting root: ok.
[   25.282583] sophgo-spifmc 7000180000.flash-controller: gd25lb512me (65536 Kbytes)
[   25.298302] sophgo-spifmc 7002180000.flash-controller: gd25lb512me (65536 Kbytes)
[   25.377832] r8169 0003:c5:00.0 eth0: RTL8125B, 00:e0:4c:68:01:31, XID 641, IRQ 60
[   25.377855] r8169 0003:c5:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
[   25.401498] r8169 0003:c9:00.0 eth1: RTL8125B, 00:e0:4c:68:01:32, XID 641, IRQ 61
[   25.401517] r8169 0003:c9:00.0 eth1: jumbo features [frames: 9194 bytes, tx checksumming: ko]
[   26.030170] NET: Registered PF_PACKET protocol family
[   26.484762] EXT4-fs (nvme0n1p2): re-mounted. Quota mode: none.
[   26.759285] ata5.00: exception Emask 0x10 SAct 0x200000 SErr 0xb00100 action 0x6
[   26.759303] ata5.00: irq_stat 0x08000000
[   26.759309] ata5: SError: { UnrecovData Dispar BadCRC LinkSeq }
[   26.759320] ata5.00: failed command: READ FPDMA QUEUED
[   26.759326] ata5.00: cmd 60/00:a8:e0:00:00/01:00:00:00:00/40 tag 21 ncq dma 131072 in
[   26.759326]          res 40/00:ac:e0:00:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[   26.759340] ata5.00: status: { DRDY }
[   26.759350] ata5: hard resetting link
[   27.235248] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   27.236014] ata5.00: supports DRM functions and may not be fully accessible
[   27.237068] ata5.00: supports DRM functions and may not be fully accessible
[   27.237931] ata5.00: configured for UDMA/133
[   27.237990] sd 4:0:0:0: [sda] tag#21 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
[   27.238002] sd 4:0:0:0: [sda] tag#21 Sense Key : 0x5 [current] 
[   27.238009] sd 4:0:0:0: [sda] tag#21 ASC=0x21 ASCQ=0x4 
[   27.238019] sd 4:0:0:0: [sda] tag#21 CDB: opcode=0x28 28 00 00 00 00 e0 00 01 00 00
[   27.238025] I/O error, dev sda, sector 224 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
[   27.238086] ata5: EH complete
[   27.324685] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
[   28.071241] RTL8226B_RTL8221B 2.5Gbps PHY r8169-3-c500:00: attached PHY driver (mii_bus:phy_addr=r8169-3-c500:00, irq=MAC)
[   28.271360] r8169 0003:c5:00.0 eth0: Link is Down
[   30.031234] random: crng init done
[   31.278294] r8169 0003:c5:00.0 eth0: Link is Up - 1Gbps/Full - flow control off
[   31.278328] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   31.494869] Unable to handle kernel paging request at virtual address ffffffd7fde00000
[   31.494891] Oops [#1]
[   31.514737] Modules linked in: af_packet r8169 realtek pwm_fan gpio_dwapb ofpart sophgo_spifmc spi_nor mtd evdev input_leds hid_generic usbhid hid ahci libahci libata nvme nvme_core amdgpu gpu_sched drm_buddy radeon drm_ttm_helper ttm drm_display_helper i2c_algo_bit xhci_pci xhci_hcd sdhci_sophgo sdhci_pltfm sdhci led_class simpledrm drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm drm_panel_orientation_quirks cfbfillrect cfbimgblt cfbcopyarea loop usb_storage usbcore usb_common
[   31.618484] CPU: 56 PID: 3184 Comm: modprobe Not tainted 6.1.80-0-sophgo #1-Alpine
[   31.646118] Hardware name: Sophgo Mango (DT)
[   31.670333] epc : __split_linear_mapping_pgd+0x2f0/0x4ac
[   31.695607]  ra : __set_memory+0x1ea/0x336
[   31.719460] epc : ffffffff8000b05e ra : ffffffff8000b404 sp : ffffffc8121fbb00
[   31.746539]  gp : ffffffff8114d478 tp : ffffffefff808ec0 t0 : ffffffeffed36f81
[   31.773767]  t1 : ffffffeffed364c0 t2 : 7473007265776f6c s0 : ffffffc8121fbbd0
[   31.801180]  s1 : 0000000000200000 a0 : ffffffd7fde00000 a1 : 0000000000000000
[   31.828847]  a2 : 000000003fffffff a3 : 0000000000200000 a4 : 0000000000000fff
[   31.856635]  a5 : 0000000000001000 a6 : ffffffeffed36f80 a7 : ffffffeffed36f80
[   31.884317]  s2 : 0000000000001000 s3 : ffffffffffffffff s4 : ffc00000000003ff
[   31.912121]  s5 : ffffffd7fde00000 s6 : 0000000040000000 s7 : ffffffff8116d000
[   31.940129]  s8 : ffffffffffe00000 s9 : 0000000000000000 s10: 0000000000200000
[   31.968229]  s11: ffffffd7fde00000 t3 : ffffffffffffffff t4 : 0000000000000000
[   31.996433]  t5 : ffffffeffed36f81 t6 : 0000000000040000
[   32.022563] status: 0000000200000120 badaddr: ffffffd7fde00000 cause: 000000000000000d
[   32.051780] [<ffffffff8000b05e>] __split_linear_mapping_pgd+0x2f0/0x4ac
[   32.079964] [<ffffffff8000b404>] __set_memory+0x1ea/0x336
[   32.107045] [<ffffffff8000b578>] set_memory_ro+0x10/0x18
[   32.133963] [<ffffffff800784be>] module_enable_ro+0x5a/0xe8
[   32.161031] [<ffffffff80077554>] load_module+0x114c/0x1826
[   32.187879] [<ffffffff80077d3e>] __do_sys_init_module+0x110/0x136
[   32.215293] [<ffffffff80077e38>] sys_init_module+0xc/0x14
[   32.241918] [<ffffffff800034de>] ret_from_syscall+0x0/0x2
[   32.268537] ---[ end trace 0000000000000000 ]---

We are not able to ssh to the machine. lsmod hangs.

xingxg2022 commented 5 months ago

Please keep the commit below:

Revert "riscv: Fix wrong usage of lm_alias() when splitting a huge linear mapping" Revert "riscv: Fix set_direct_map_default_noflush() to reset _PAGE_EXEC" Revert "riscv: Fix set_memory_XX() and set_direct_map_XX() by splitting huge linear mappings"

ncopa commented 5 months ago

Do you mean we should re-apply the reverted commits?

Should we also disable HIGHMEM in that case?

xingxg2022 commented 5 months ago

only need one .

so, there are two solutions: 1) apply the previous three reverted commits 2) or disable CONFIG_HIGHMEM

ncopa commented 5 months ago

Disabling HIGHMEM was no success:

[    0.000000] Linux version 6.1.89-0-sophgo (buildozer@build-edge-riscv64) (gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309, GNU ld (GNU Binutils) 2.42) #1-Alpine SMP PREEMPT Thu, 02 May 2024 07:45:04 +0000
[    0.000000] OF: fdt: Ignoring memory range 0x0 - 0x2200000
[    0.000000] Machine model: Sophgo Mango
[    0.000000] earlycon: uart0 at MMIO32 0x0000007040000000 (options '')
[    0.000000] printk: bootconsole [uart0] enabled
[    0.000000] Hide vector 0.7 extension
[    0.000000] efi: UEFI not found.
[    0.000000] Unable to handle kernel paging request at virtual address fffffff73ff3afa0
[    0.000000] Oops [#1]
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.89-0-sophgo #1-Alpine
[    0.000000] Hardware name: Sophgo Mango (DT)
[    0.000000] epc : __memset+0xb4/0xfc
[    0.000000]  ra : memblock_alloc_try_nid+0x74/0x84
[    0.000000] epc : ffffffff806eac48 ra : ffffffff808132c2 sp : ffffffff81003e60
[    0.000000]  gp : ffffffff8114d348 tp : ffffffff81012dc0 t0 : fffffff73ff3aef8
[    0.000000]  t1 : 0000001f42183000 t2 : 49464555203a6966 s0 : ffffffff81003ea0
[    0.000000]  s1 : 000000000004805c a0 : fffffff73ff3afa0 a1 : 0000000000000000
[    0.000000]  a2 : 000000000004805c a3 : fffffff73ff82ff8 a4 : 0000000000000054
[    0.000000]  a5 : ffffffff806eac48 a6 : 000000000004805c a7 : 0000000000000080
[    0.000000]  s2 : fffffff73ff3afa0 s3 : ffffffffffffffff s4 : ffffffc6fecfb000
[    0.000000]  s5 : ffffffff80828cc8 s6 : 0000000000000000 s7 : 0000000000000000
[    0.000000]  s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000
[    0.000000]  s11: 0000000000000000 t3 : ffffffff80a09458 t4 : ffffffff80a09458
[    0.000000]  t5 : ffffffff80a09458 t6 : ffffffff80a09470
[    0.000000] status: 0000000200000100 badaddr: fffffff73ff3afa0 cause: 000000000000000f
[    0.000000] [<ffffffff806eac48>] __memset+0xb4/0xfc
[    0.000000] [<ffffffff80828ce6>] early_init_dt_alloc_memory_arch+0x1e/0x48
[    0.000000] [<ffffffff80525ab0>] __unflatten_device_tree+0x52/0x114
[    0.000000] [<ffffffff80829e9c>] unflatten_device_tree+0x2c/0x44
[    0.000000] [<ffffffff80803966>] setup_arch+0xd4/0x580
[    0.000000] [<ffffffff80800970>] start_kernel+0x96/0xa90
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

(I also merged in v6.1.89)

xingxg2022 commented 5 months ago

Disabling HIGHMEM was no success:

[    0.000000] Linux version 6.1.89-0-sophgo (buildozer@build-edge-riscv64) (gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309, GNU ld (GNU Binutils) 2.42) #1-Alpine SMP PREEMPT Thu, 02 May 2024 07:45:04 +0000
[    0.000000] OF: fdt: Ignoring memory range 0x0 - 0x2200000
[    0.000000] Machine model: Sophgo Mango
[    0.000000] earlycon: uart0 at MMIO32 0x0000007040000000 (options '')
[    0.000000] printk: bootconsole [uart0] enabled
[    0.000000] Hide vector 0.7 extension
[    0.000000] efi: UEFI not found.
[    0.000000] Unable to handle kernel paging request at virtual address fffffff73ff3afa0
[    0.000000] Oops [#1]
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.89-0-sophgo #1-Alpine
[    0.000000] Hardware name: Sophgo Mango (DT)
[    0.000000] epc : __memset+0xb4/0xfc
[    0.000000]  ra : memblock_alloc_try_nid+0x74/0x84
[    0.000000] epc : ffffffff806eac48 ra : ffffffff808132c2 sp : ffffffff81003e60
[    0.000000]  gp : ffffffff8114d348 tp : ffffffff81012dc0 t0 : fffffff73ff3aef8
[    0.000000]  t1 : 0000001f42183000 t2 : 49464555203a6966 s0 : ffffffff81003ea0
[    0.000000]  s1 : 000000000004805c a0 : fffffff73ff3afa0 a1 : 0000000000000000
[    0.000000]  a2 : 000000000004805c a3 : fffffff73ff82ff8 a4 : 0000000000000054
[    0.000000]  a5 : ffffffff806eac48 a6 : 000000000004805c a7 : 0000000000000080
[    0.000000]  s2 : fffffff73ff3afa0 s3 : ffffffffffffffff s4 : ffffffc6fecfb000
[    0.000000]  s5 : ffffffff80828cc8 s6 : 0000000000000000 s7 : 0000000000000000
[    0.000000]  s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000
[    0.000000]  s11: 0000000000000000 t3 : ffffffff80a09458 t4 : ffffffff80a09458
[    0.000000]  t5 : ffffffff80a09458 t6 : ffffffff80a09470
[    0.000000] status: 0000000200000100 badaddr: fffffff73ff3afa0 cause: 000000000000000f
[    0.000000] [<ffffffff806eac48>] __memset+0xb4/0xfc
[    0.000000] [<ffffffff80828ce6>] early_init_dt_alloc_memory_arch+0x1e/0x48
[    0.000000] [<ffffffff80525ab0>] __unflatten_device_tree+0x52/0x114
[    0.000000] [<ffffffff80829e9c>] unflatten_device_tree+0x2c/0x44
[    0.000000] [<ffffffff80803966>] setup_arch+0xd4/0x580
[    0.000000] [<ffffffff80800970>] start_kernel+0x96/0xa90
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

(I also merged in v6.1.89)

To fix this error, you may reduce physical memory to less than 124G that is max physical address space of direct mapping of all physical memory in SV39. (https://www.kernel.org/doc/html/next/riscv/vm-layout.html)

The cause is below: 1) memblock.current_limit is MEMBLOCK_ALLOC_ANYWHERE when disable CONFIG_HIGHMEM . 2) phys_to_virt() called from memblock_alloc_internal() returns a wrong virtual address becasue alloc variable' value is above 124G.

ncopa commented 4 months ago

I re-applied the 3 patches, re-enabled HIGHMEM, merged in v6.1.90. But it failed with:

PHOTO-2024-05-10-11-10-31

ncopa commented 4 months ago

the kernel config: https://github.com/alpinelinux/aports/blob/f2d77e8277c2f77ee5236e9aaf53579631ef8a12/testing/linux-sophgo/sophgo.riscv64.config

clandmeter commented 2 months ago

Hi @xingxg2022 we are still having issues with this. We have 2 hosts on which one hosts is seeing this issue.

It would be really nice if we could get this second box running again as its our main build server for rv64 arch for alpine linux.

xingxg2022 commented 2 months ago

I re-applied the 3 patches, re-enabled HIGHMEM, merged in v6.1.90. But it failed with:

PHOTO-2024-05-10-11-10-31

Please keep our commits below:

1)Revert "riscv: Fix wrong usage of lm_alias() when splitting a huge linear mapping" 2)Revert "riscv: Fix set_direct_map_default_noflush() to reset _PAGE_EXEC" 3)Revert "riscv: Fix set_memory_XX() and set_direct_map_XX() by splitting huge linear mappings"

The above reverts can fix the issue.

The panic log shows that your kernel code still has three patch, "riscv: Fix wrong usage of lm_alias() when splitting a huge linear mapping" "riscv: Fix set_direct_map_default_noflush() to reset _PAGE_EXEC" "riscv: Fix set_memory_XX() and set_direct_map_XX() by splitting huge linear mappings"