uclasystem / hermit

Hermit: Low-Latency, High-Throughput, and Transparent Remote Memory via Feedback-Directed Asynchrony
https://www.usenix.org/conference/nsdi23/presentation/qiao
31 stars 7 forks source link

Spinlock w/ corrupted value after installation of Hermit Kernel #5

Open yashlala opened 1 year ago

yashlala commented 1 year ago

Dear Yifan,

When installing the Hermit kernel as described in the README, the kernel ring buffer fills up with warnings about corrupted spinlocks. I wanted to check whether these messages are known issues w/ Hermit, or indicative of a larger problem (I was having issues w/ setup otherwise).

Environment

(identical to the Hermit README)

To Reproduce

Follow the README for the following steps:

dmesg will show that the kernel ring buffer is full of corrupted spinlock related messages. Eg:

[   31.809054] ------------[ cut here ]------------
[   31.809062] pvqspinlock: lock 0xffff9b627d208ff8 has corrupted value 0x0!
[   31.809076] WARNING: CPU: 3 PID: 912 at kernel/locking/qspinlock_paravirt.h:498 __pv_queued_spin_unlock_slowpath+0xbd/0xd0
[   31.809086] Modules linked in: nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common binfmt_misc kvm_amd ccp kvm joydev input_leds serio_raw irqbypass qemu_fw_cfg mac_hid sch_fq_codel msr drm ramoops reed_solomon efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd mlx5_core pci_hyperv_intf ahci cryptd mlxfw psmouse i2c_i801 ptp libahci i2c_smbus lpc_ich virtio_net pps_core virtio_blk net_failover failover
[   31.809123] CPU: 3 PID: 912 Comm: sshd Tainted: G        W         5.14.0-rc5 #1
[   31.809126] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
[   31.809127] RIP: 0010:__pv_queued_spin_unlock_slowpath+0xbd/0xd0
[   31.809130] Code: 07 00 48 63 7a 10 e8 92 9f f8 ff 66 90 c3 8b 05 91 e9 86 01 85 c0 74 01 c3 8b 17 48 89 fe 48 c7 c7 18 84 f4 82 e8 f0 01 98 00 <0f> 0b c3 0f 0b cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 0b cc
[   31.809133] RSP: 0018:ffffbf5cb2e1fce8 EFLAGS: 00010082
[   31.809135] RAX: 0000000000000000 RBX: ffff9b627d208000 RCX: 0000000000000027
[   31.809137] RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff9b65afad88f8
[   31.809138] RBP: ffff9b627d208000 R08: ffff9b65afad88f0 R09: ffffbf5cb2e1fb08
[   31.809139] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9b627d837800
[   31.809140] R13: ffff9b627d208ff8 R14: ffffbf5cb2e1fdb0 R15: 0000000000000001
[   31.809143] FS:  00007f11331faf40(0000) GS:ffff9b65afac0000(0000) knlGS:0000000000000000
[   31.809145] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   31.809146] CR2: 00007f1133119000 CR3: 000000014b780005 CR4: 0000000000770ee0
[   31.809155] PKRU: 55555554
[   31.809156] Call Trace:
[   31.809159]  __raw_callee_save___pv_queued_spin_unlock_slowpath+0x11/0x20
[   31.809164]  .slowpath+0x9/0xe
[   31.809166]  hmt_async_reclaim+0x1fd/0x2e0
[   31.809173]  try_charge_memcg_profiling+0x657/0x820
[   31.809178]  ? _copy_to_user+0x1c/0x30
[   31.809185]  ? cp_new_stat+0x150/0x180
[   31.809188]  ? prepare_creds+0x22/0x280
[   31.809192]  obj_cgroup_charge_pages+0x42/0x100
[   31.809195]  ? prepare_creds+0x22/0x280
[   31.809196]  obj_cgroup_charge+0x74/0x110
[   31.809199]  ? get_obj_cgroup_from_current+0xad/0x160
[   31.809201]  kmem_cache_alloc+0x70/0x390
[   31.809205]  prepare_creds+0x22/0x280
[   31.809207]  do_faccessat+0x1bc/0x270
[   31.809211]  ? exit_to_user_mode_prepare_profiling.isra.0+0x35/0x180
[   31.809215]  do_syscall_64+0x3b/0x90
[   31.809223]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   31.809228] RIP: 0033:0x7f11335e713b
[   31.809230] Code: 77 05 c3 0f 1f 40 00 48 8b 15 51 dd 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa b8 15 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 21 dd 0d 00 f7 d8
[   31.809232] RSP: 002b:00007ffd4c495e58 EFLAGS: 00000246 ORIG_RAX: 0000000000000015
[   31.809234] RAX: ffffffffffffffda RBX: 00007ffd4c495fc0 RCX: 00007f11335e713b
[   31.809235] RDX: 000000000000000e RSI: 0000000000000000 RDI: 00007ffd4c495e60
[   31.809236] RBP: 00007ffd4c495eb0 R08: 0000000000000007 R09: 00303030312f7265
[   31.809237] R10: 0000562bf67e0010 R11: 0000000000000246 R12: 0000562bf68023a0
[   31.809237] R13: 0000562bf6831624 R14: 0000000000000000 R15: 0000000000000000
[   31.809240] ---[ end trace b6d8cf33f16604d8 ]---

A more complete dmesg log is linked here

Regards,

Yash

ivanium commented 1 year ago

Hi Yash,

I haven't encountered such issue before and it seems the error is not quite related to Hermit's logic. My current guess is that Hermit changed the cgroup struct definition and there may be some mismatches in your kernel header files and the compiled kernel image. Anyways, I will double check it soon and get back to you.

Thanks, Yifan

yashlala commented 1 year ago

Dear Yifan,

Thank you for the quick response!

That's a good hypothesis. I've tested it and a few other potential causes. Here are my test notes, hopefully they can help:

Possible Causes to Investigate: