vmexit / trio-sosp23-ae

Apache License 2.0
13 stars 3 forks source link

Lockup during block allocation #3

Open hayley-leblanc opened 8 months ago

hayley-leblanc commented 8 months ago

Hi Trio authors,

I am trying to run a simple workload that creates a 1000 directories in ArckFS and am running into some errors that appear to be related to block allocation. I followed the setup steps (excluding building the other file systems and included benchmarks) in the README, then followed the initialization steps in the minimal working example document to mount ArckFS. My workload is compiled against libfs and sufs.so the same way that the fsutils programs are. I'm running ArckFS on a 64-core machine running Debian Bookworm and Linux kernel version 6.3.0 using one 128GB Intel Optane PMM for the experiment. I ported the KFS module to this kernel version (including modifications to drivers/dax/bus.c in the kernel itself) but the errors I am encountering seem unrelated to the port.

I did a bit of printk debugging and determined that these errors appear to only trigger when the cpu argument to sufs_new_blocks is 63 and the pm_node argument is 1.

The most common issue I'm seeing is a soft lockup, usually early in the workload. An example stack trace is included below.

[  316.367393] watchdog: BUG: soft lockup - CPU#20 stuck for 26s! [syscall_latency:3965]
[  316.375222] Modules linked in: sufs(OE) intel_rapl_msr intel_rapl_common nd_pmem nd_btt i10nm_edac ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intei
[  316.443151] CPU: 20 PID: 3965 Comm: syscall_latency Tainted: G           OE      6.3.0+ #16
[  316.451497] Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/2021
[  316.458977] RIP: 0010:native_queued_spin_lock_slowpath+0xa2/0x2b0
[  316.465068] Code: c0 f0 0f ba 2b 08 0f 92 c0 89 c2 c1 e2 08 b9 ff 00 ff ff 23 0b 09 d1 81 f9 00 01 00 00 73 1a 85 c9 74 0c 80 3b 00 74 07 f3 90 <80> 3b 00 75 f9 66 1
[  316.483813] RSP: 0018:ffffabf109dbbb80 EFLAGS: 00000206
[  316.489041] RAX: 0000000000000000 RBX: ffff9f5ac581c000 RCX: 0000000000000005
[  316.496172] RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffff9f5ac581c000
[  316.503306] RBP: ffffabf109dbbba0 R08: 0000000000000040 R09: 0000000000000001
[  316.510439] R10: 0000000000000000 R11: ffffffffc17c0300 R12: 0000000000002000
[  316.517573] R13: ffff9f5ac581a000 R14: ffff9f5ac581c000 R15: 00007ffe96f50a30
[  316.524703] FS:  00007f88b667d7c0(0000) GS:ffff9f69ffe80000(0000) knlGS:0000000000000000
[  316.532789] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  316.538537] CR2: 00007f883e873000 CR3: 0000000159faa006 CR4: 0000000000770ee0
[  316.545670] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  316.552803] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  316.559936] PKRU: 55555554
[  316.562646] Call Trace:
[  316.565099]  <TASK>
[  316.567209]  _raw_spin_lock+0x2a/0x30
[  316.570872]  sufs_new_blocks+0x8e/0x560 [sufs]
[  316.575318]  ? obj_cgroup_charge_pages+0xa8/0xe0
[  316.579938]  ? __memcg_kmem_charge_page+0x152/0x200
[  316.584817]  ? __alloc_pages+0x1e7/0x360
[  316.588744]  ? __mod_lruvec_state+0x5d/0xb0
[  316.592928]  ? __mod_lruvec_page_state+0x9f/0x150
[  316.597633]  ? pmd_install+0xd2/0xe0
[  316.601214]  ? walk_system_ram_range+0x115/0x130
[  316.605833]  ? pgprot_writethrough+0x30/0x30
[  316.610108]  ? __get_locked_pte+0xce/0x100
[  316.614204]  ? insert_pfn+0x19a/0x1d0
[  316.617873]  ? vmf_insert_pfn_prot+0x12f/0x160
[  316.622317]  ? sufs_map_pages+0x4c/0x80 [sufs]
[  316.626765]  sufs_alloc_blocks_to_libfs+0x81/0x1b0 [sufs]
[  316.632164]  sufs_kfs_ioctl+0x13a/0x190 [sufs]
[  316.636610]  __se_sys_ioctl+0x7a/0xc0
[  316.640275]  __x64_sys_ioctl+0x21/0x30
[  316.644028]  do_syscall_64+0x45/0x90
[  316.647608]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  316.652658] RIP: 0033:0x7f88b6c795a9
[  316.656239] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 8
[  316.674984] RSP: 002b:00007ffe96f50a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  316.682552] RAX: ffffffffffffffda RBX: 00007ffe96f50a98 RCX: 00007f88b6c795a9
[  316.689683] RDX: 00007ffe96f50a30 RSI: 0000000000001007 RDI: 0000000000000003
[  316.696817] RBP: 00007ffe96f50a90 R08: 0000000001e09d50 R09: 0000000001e85dc8
[  316.703950] R10: 0000000001d8dcd8 R11: 0000000000000246 R12: 00007f88b6d84d80
[  316.711082] R13: 00007f88b6d84d80 R14: 000000000000003f R15: ffffc00000000000
[  316.718217]  </TASK>

I've also seen a couple of page-fault-related kernel panics that are also related to CPU 63, but I haven't been able to replicate this one to include a stack trace; it seems to occur more or less randomly. The call trace is similar or identical in the page fault panic.

I rebuilt the kernel with CONFIG_DEBUG_SPINLOCK enabled to try to get more information about the error, which seems to enable a lock magic value check that consistently fails on cpu 63 and pm_node 1.

[  398.645676] BUG: spinlock bad magic on CPU#62, syscall_latency/6104
[  398.651947]  lock: 0xffff8fa5eff8a400, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[  398.660123] CPU: 62 PID: 6104 Comm: syscall_latency Tainted: G           OE      6.3.0+ #15
[  398.668473] Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/2021
[  398.675951] Call Trace:
[  398.678405]  <TASK>
[  398.680511]  dump_stack_lvl+0x5c/0x80
[  398.684175]  dump_stack+0x14/0x20
[  398.687497]  spin_bug+0x9b/0xa0
[  398.690640]  do_raw_spin_lock+0x6c/0xa0
[  398.694483]  _raw_spin_lock+0x19/0x20
[  398.698146]  sufs_new_blocks+0x92/0x580 [sufs]
[  398.702592]  ? obj_cgroup_charge_pages+0xa8/0xe0
[  398.707211]  ? __memcg_kmem_charge_page+0x152/0x200
[  398.712091]  ? __alloc_pages+0x1e7/0x360
[  398.716017]  ? __mod_lruvec_state+0x5d/0xb0
[  398.720202]  ? _raw_spin_unlock+0x12/0x30
[  398.724217]  ? pmd_install+0xd2/0xe0
[  398.727796]  ? _raw_read_unlock+0x12/0x30
[  398.731808]  ? walk_system_ram_range+0x115/0x130
[  398.736429]  ? _raw_spin_lock+0x19/0x20
[  398.740268]  ? _raw_spin_unlock+0x12/0x30
[  398.744280]  ? insert_pfn+0x19a/0x1d0
[  398.747946]  ? vmf_insert_pfn_prot+0x12f/0x160
[  398.752392]  ? sufs_map_pages+0x4c/0x80 [sufs]
[  398.756837]  sufs_alloc_blocks_to_libfs+0x81/0x1b0 [sufs]
[  398.762237]  sufs_kfs_ioctl+0x13a/0x190 [sufs]
[  398.766684]  __se_sys_ioctl+0x7a/0xc0
[  398.770348]  __x64_sys_ioctl+0x21/0x30
[  398.774102]  do_syscall_64+0x45/0x90
[  398.777682]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  398.782735] RIP: 0033:0x7f64a46455a9
[  398.786314] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 8
[  398.805059] RSP: 002b:00007ffd04622848 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  398.812626] RAX: ffffffffffffffda RBX: 00007ffd046228b8 RCX: 00007f64a46455a9
[  398.819757] RDX: 00007ffd04622850 RSI: 0000000000001007 RDI: 0000000000000003
[  398.826891] RBP: 00007ffd046228b0 R08: 0000000001e09d51 R09: 0000000001e85dc8
[  398.834022] R10: 0000000001d8dcd8 R11: 0000000000000246 R12: 00007f64a4750d80
[  398.841156] R13: 00007f64a4750d80 R14: 000000000000003f R15: ffffc00000000000
[  398.848291]  </TASK>

I noticed that sufs_get_free_list returns free_lists[cpu * sb->pm_nodes + pm_node] but sufs_alloc_block_free_lists only allocates an array of size cpus * sb->pm_nodes; it seems that sufs_get_free_list may be going out of bounds on the free list array in the problematic case I'm running into. I was hoping to fix this but myself but I haven't really been able to understand the role of the pm_node value; why is it sometimes 0 and sometimes 1 when I am only using 1 PM device?

Thank you in advance for your help!

vmexit commented 8 months ago

Hi,

Thanks for the bug report.

I suspect that the bug occurs because the application runs in two NUMA nodes, while only the PM in one NUMA node is passed to ArckFS. This is a scenario we forget to handle in our prototype and a call stack (that starts from LibFS) similar to something below may cause an issue.

sufs_libfs_do_new_dir_data_blocks --> sufs_libfs_new_blocks(..., sufs_libfs_current_node()) //sufs_libfs_current_node() returns the current NUMA node of the invoking CPU, which could be either 0 to 1 in this case) --> sufs_libfs_cmd_alloc_blocks --> sufs_alloc_blocks_to_libfs(pmnode == 1)

Could you confirm that is the scenario?

hayley-leblanc commented 8 months ago

Thanks for the quick response! Yes, the machine has two NUMA nodes but I'm only using one NVDIMM. I tried running my program with numactl --cpunodebind=0 and that seems to have resolved the problem, so I think you are correct. If there's a better (quick) fix I could implement, I'm happy to make a PR; otherwise I'll use numactl as a workaround.