Open plq opened 1 year ago
https://www.kernel.org/doc/html/latest/x86/x86_64/mm.html
The memory is in the “direct mapping of all physical memory (page_offset_base)“, but was somehow unmapped. It is possible that this is not our bug such that there was either a bug elsewhere in the kernel or a device somehow did a wild write into the page table. We can rule out a bitflip on the present bit since we would not have a zero pointer in the PTE if that happened.
We can harden the code against bugs in itself by adding assertions to our use of kunmap/kunmap_atomic() to catch instances where memory from that region (or more specifically, memory from any region where kmap()/kmap_atomic() does not return an address) is passed to kunmap()/kunmap_atomic(). Given the expense of mapping/unmapping kernel memory, it might not hurt to make it a VERIFY3P() statement so that the assertion is run on non-debug builds too.
I am preparing for a 2 week trip, so I might not send a patch to harden the code against this class of bugs until I return.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Describe the problem you're observing
all fs operations froze. zfs is not my root partition yet any command that needed storage access froze. rebooted using alt+sysrq+b
Describe how to reproduce the problem
The software I'm working on has a database stress test (sqlite). I was running that at full scale. Could not reproduce this afterwards no matter what
Include any warning/errors/backtraces from the system logs
-->