Closed klange closed 5 years ago
This was a fun one.
After extensive investigation in GDB and the QEMU monitor, it was discovered that the compositor was getting mappings to low physical memory (where the kernel resides) when it called sbrk
. Further investigation showed that the physical frame bitmap had been cleared. This eventually led to the discovery of a bug in the ramdisk freeing code, which was freeing frames it did not full own (and which had been allocated through the placement pointer allocator... to the frame bitmap). ext2
ramdisks were always a multiple of page size. The tarfs ramdisks are only multiples of 512, so there was a 1 in 8 chance everything would be fine.
10:49:35 <... klange> So I do this thing on my CDs where I take a read-only
ramdisk and extract it out into my in-memory read-write
tmpfs.
10:49:57 <... klange> Obviously if I have a ~20MB ramdisk, I want the space
that was used to hold that to be available for the
system after the move to the read-write tmpfs.
10:50:14 <... klange> So naturally I clear out the frames it was using so they
can be reclaimed by the pmm.
10:50:33 <... klange> I recently switched to using tarballs for those
ramdisks, from mini ext2 filesystems.
10:50:42 <... klange> The tarballs are easier to create and have less overhead.
10:51:09 <... klange> An interesting property of the ext2 filesystems is that
they were always a multiple of page size, due to block
requirements in ext2.
10:51:16 <... klange> Tarballs are only multiples of 512 bytes.
10:52:19 <... klange> The tarfs ramdisks are the last thing that gets loaded
by the bootloader, and the early placement pointer
allocator starts immediately afterwards.
10:52:36 <... klange> And the next that gets allocated is... the page frame
bitmap.
10:52:44 <... klange> That the pmm uses... for allocations...
10:53:34 <... klange> So everything's fine until the startup says "okay you
can remove the ramdisk now I'm done with it" and the
kernel goes and frees... the frame with the start of the
frame bitmap.
10:54:10 <.. klange> Which then gets reallocated to something stupid like a
bitmap with a font in it.
10:54:29 <.. klange> Which then marks larges swaths of the kernel as available
for the PMM.
When switching out an
ext2
ramdisk for atar
ramdisk, an issue rather consistently shows up when launching the compositor, causing crashes and even complete corruption of the VM environment. Initially, the tarfs driver itself was suspected, but a thorough analysis has cleared it of any wrongdoing - bounds checking and strict limits on copy lengths all checked out.To start investigating this issue, I built a new set of memory allocation tracking tools. After several revisions and improvements to these tools, I believe the issue is has been narrowed down to a corruption of memory used for kernel stacks, as well as corruption of memory used for file descriptor tables (the latter may be caused by the former, it's hard to tell). The addition of guard pages around kernel stacks suggests that something - possibly page directory management - is touching regions it should not be touching.