Buffer/fs bug - Githubissues

ghaerr / elks

Embeddable Linux Kernel Subset - Linux for 8086

Other

1.03k stars 108 forks source link

Buffer/fs bug #1367

Closed Mellvik closed 2 years ago

Mellvik commented 2 years ago

In short: Removing (rm) a 'large' (as in several k bytes) cause the calling process to hang (cannot be interrupted). Shortly after, the system will hang, some times 'hard' (as in 'ctl-alt-del' not working).

This does NOT happen when XMS support is compiled in, whether enabled via /bootopts or not. Presumably that's the reason we haven't seen this before.

elks17# cd /
elks17# ls
EXEC: '/bin/ls' env 134
OPEN '.' flags 0x0 = 4
CLOSE 4
bin          bootopts     dev          etc          fd           linux        
linux.buf    linux.nobuf  mnt          mnt2         mnt3         mnt4         
nlinux       root         tmp          xx           zz           
elks17# touch x
BLK 11 read fc002180:3
EXEC: '/bin/touch' env 141
BLK 808 read b0002180:3
OPEN 'x' flags 0x241 = 4
CLOSE 4
elks17# rm x
EXEC: '/bin/rm' env 138
BLK 1325 read 6c002180:3
BLK 1326 read 2180:3
BLK 1327 read 40002180:3
BLK 1328 read 48002180:3
BLK 1329 read f0002180:3
BLK 1330 read 4c002180:3
UNLINK 'x'
elks17# ls -l zz
EXEC: '/bin/ls' env 144
OPEN '/etc/passwd' flags 0x0 = 4
BLK 1429 read e4002180:3
CLOSE 4
OPEN '/etc/group' flags 0x0 = 4
BLK 1422 read 50002180:3
CLOSE 4
-rw-r--r--  1 root     root        81472  Jul 12 10:46 zz
elks17# rm zz
EXEC: '/bin/rm' env 139
UNLINK 'zz'                   <---------- hang

ghaerr commented 2 years ago

Hello @Mellvik,

I saw this one years ago, and its on my list. The reason for the hang is that the MINIX filesystem in some cases requires more local L1 locked buffers than there are available (8 unless FAT linked in, then 12). The rm process hangs in the kernel waiting for another buffer to become free, and rm already has all the buffers locked, so deadlock.

It is somewhat hard to believe that the kernel needs more than 12K buffers to remove a file, but this happens when the inode is using a double-indirect block list, I think. We're very short on kernel data segment space, so merely increasing the L1 size isn't a good option.

This does NOT happen when XMS support is compiled in,

Interesting. Perhaps its an L2/L1 issue then, not just L1.

I'll continue to look into this.

Thank you!

Mellvik commented 2 years ago

Thank you @ghaerr.

It's strange I haven't run into this one before. After all, the XMS support is a fairly recent addition. Without XMS ELKS hangs every time I (say) 'rm linux'. Then again I may have a setting (this is new one, new machine etc) that just happen to make the bits in the puzzle fit exactly.

—M

jul. 2022 kl. 19:05 skrev Gregory Haerr @.***>:

Hello @Mellvik https://github.com/Mellvik,

I saw this one years ago, and its on my list. The reason for the hang is that the MINIX filesystem in some cases requires more local L1 locked buffers than there are available (8 unless FAT linked in, then 12). The rm process hangs in the kernel waiting for another buffer to become free, and rm already has all the buffers locked, so deadlock.

It is somewhat hard to believe that the kernel needs more than 12K buffers to remove a file, but this happens when the inode is using a double-indirect block list, I think. We're very short on kernel data segment space, so merely increasing the L1 size isn't a good option.

This does NOT happen when XMS support is compiled in,

Interesting. Perhaps its an L2/L1 issue then, not just L1.

I'll continue to look into this.

Thank you!

— Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1367#issuecomment-1182020992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOENET6UELJKTD7W6FLVTWQWLANCNFSM53KDGGZQ. You are receiving this because you were mentioned.

ghaerr commented 2 years ago

Without XMS ELKS hangs every time I (say) 'rm linux'.

Oh, good to know we have a larger file available to delete. I'll use this for a test case and will up the priority.

After all, the XMS support is a fairly recent addition

Yes, I got a bit spoiled myself with a default .config with XMS and 2880k disk support. I'm back to running 1440k and no XMS, so I'll see whether this repeats over here.

Mellvik commented 2 years ago

Yes, I got a bit spoiled myself with a default .config with XMS and 2880k disk support. I'm back to running 1440k and no XMS, so I'll see whether this repeats over here.

Welcome to the club :-) . Well, I guess you're on QEMU so it's fast anyway… I got really spoiled recently when I started to run of of HD instead of floppy. He.., it's fast.

Crazy as it sounds, 'rm linux' is probably the command I use the most these days, building and transferring 50-100 kernels per day. :-)

BTW - I still have sys problems on the compaqs (MBR boot doesn't find partition boot), while the same works great on this other machine (40MHz 386SX). To be investigated.

-M

ghaerr commented 2 years ago

Without XMS ELKS hangs every time I (say) 'rm linux'. Oh, good to know we have a larger file available to delete. I'll use this for a test case and will up the priority.

I just tested with our "standard" configuration ibmpc-1440.config (used for the auto Github builds), and rm linux works great, no hang. Can you try using that configuration, or diffing it with your .config, to see what might be happening? It would be worthwhile to see whether QEMU hangs for you as well, perhaps this problem is not the same as I described.

I still have sys problems on the compaqs (MBR boot doesn't find partition boot), while the same works great on this other machine (40MHz 386SX). To be investigated.

Definitely interested to hear more about that.

Mellvik commented 2 years ago

I still have sys problems on the compaqs (MBR boot doesn't find partition boot), while the same works great on this other machine (40MHz 386SX). To be investigated.

Definitely interested to hear more about that.

OK, this mystery has been resolved if not fixed.

It turns out (news to me, not to you) that sys assumes the source fs is an image created by the build_hd_image.sh script so that the partition boot block (aka floppy boot block) is present at a predictable place. Which is what happened on my newer system. But the old system wasn't created that way, it evolved over time, all 3 Minix FSs being created manually, sys used only recently. IOW there never was a valid boot block to copy.

What I suggest, in order to make this more failsafe, is that either minix mkfs or sys copiy the bootblock explicitly (from file).

-M

ghaerr commented 2 years ago

sys assumes the source fs is an image created by the build_hd_image.sh script

Wow - I've never seen that script before. It is extremely outdated and should probably be removed, as ELKS now creates all HD images (hd32mbr-minix.img and hd32-minix.img) using Makefiles under the images/ directory, and doesn't require loopback filesystems to build. Of course, these "standard" HD images are really only testing, as few people have a 32MB HD!

so that the partition boot block (aka floppy boot block) is present at a predictable place. IOW there never was a valid boot block to copy.

Which boot block exactly are you talking about here - the MBR, or the boot block which is the first sector of every partition? I don't quite yet understand why the Compaq is failing on sys.

What I suggest, in order to make this more failsafe, is that either minix mkfs or sys copiy the bootblock explicitly (from file).

I've thought about adding another option that allows copying a boot sector as a seperate file, but sys (or makeboot) doesn't have that file available to it on a standard boot FD or HD system. It could be possible to compile-in a boot sector into makeboot. Once I fully understand what is happening on your Compaq, we can work to come up with a better solution.

Mellvik commented 2 years ago

Back to the rm hang problem:

This is repeatable only on one particular machine, so I'm going after it in that setting. I was hoping it had to do with the memcmp problem just reported, it wasn't.

-M

ghaerr commented 2 years ago

Back to the rm hang problem This is repeatable only on one particular machine, so I'm going after it in that setting

If this only repeats with the same compiled kernel on a different machine, then that's very strange - suspect possible memory problem? Does that machine have different amount of main memory (assume no XMS)?

Otherwise, I suspect may be able to reproduce with different CONFIG_FS_NR_EXT_BUFFERS= setting, as I have seen this problem before. Also possibly try recompiling with standard ibmpc-1440.config file for failing machine.

Mellvik commented 2 years ago

Thank you @ghaerr -

Back to the rm hang problem This is repeatable only on one particular machine, so I'm going after it in that setting

If this only repeats with the same compiled kernel on a different machine, then that's very strange - suspect possible memory problem? Does that machine have different amount of main memory (assume no XMS)?

Yes, it still happens only if XMS is not compiled in. And yes, the main memory is odd sized, 607K as reported by ELKS. The machine has 4M total, and the reason for the odd main memory size is the DiskOnChip, which is an interesting story by itself, for another time. I can actually remove the chip and increase main memory to 637k - I'll do that just for the heck of it. Otherwise, I suspect may be able to reproduce with different CONFIG_FS_NR_EXT_BUFFERS= setting, as I have seen this problem before. Also possibly try recompiling with standard ibmpc-1440.config file for failing machine.

Thanks, will check out that too! It would be good to figure this one out now that I have a repeatable setting.

—M

Mellvik commented 2 years ago

Update: This turns out to be a touch cookie indeed.

Given that the problem appears on this particular hardware only, not on QEMU, not on the compaqs, it would be reasonable to guess 'race condition'. However, since the problem goes (reliably, predictably) away when adding the right debug printks in the code, that seems unlikely. Also, if those printks don't have arguments, just text, they have no effect. So we have a stack pollution issue, a very elusive one: The printks can be in any of a number of routines and will have slightly different effects but always hang the process or the system. And frequently destroy the root file system, making it a very slow process.

More later.

-M

ghaerr commented 2 years ago

Also, if those printks don't have arguments, just text, they have no effect. So we have a stack pollution issue, a very elusive one: The printks can be in any of a number of routines and will have slightly different effects but always hang the process or the system.

Agreed. Now that you describe this new behavior, I have seen this before, and the kernel stack was being corrupted due to some strange behavior with the LED lights and keyboard controller code. I can't remember the exact issue, I'll look it back up, but it was finally identified and fixed by @tkchia. Does this particular system have anything unusual about the keyboard or its controller? Does the issue occur when no Caps Lock or other LED keys have ever been pressed?

Mellvik commented 2 years ago

Also, if those printks don't have arguments, just text, they have no effect. So we have a stack pollution issue, a very elusive one: The printks can be in any of a number of routines and will have slightly different effects but always hang the process or the system.

Agreed. Now that you describe this new behavior, I have seen this before, and the kernel stack was being corrupted due to some strange behavior with the LED lights and keyboard controller code. I can't remember the exact issue, I'll look it back up, but it was finally identified and fixed by @tkchia https://github.com/tkchia. Does this particular system have anything unusual about the keyboard or its controller? Does the issue occur when no Caps Lock or other LED keys have ever been pressed?

Yes, the keyboard is hardly ever touched (not quite true since it comes to rescue every time the FS has been corrupted, but still it's rare). I ran into an interesting case this morning in which one printk is just ignored, the next one - 2 lines down, prints. This is in minix/truncate.c which is one of the suspects.

Are any of the 'stack protection' options to gcc included in our toolchain btw? And what about the register declarations, are the ignored or actually used? I noticed gcc complains if you try to get the address of a register variable, which makes sense but may not be a reliable signal...

-M

ghaerr commented 2 years ago

Reviewing this issue a bit, I am now guessing that the kernel corruption, now more firmly evidenced by printk's not working (which occurred on the last similar issue), is ultimately happening to scribble on EXT memory buffers, which is causing the root file system corruption. It doesn't occur on XMS likely because the memory buffers are in non-addressable XMS memory.

Although not a for sure, we likely need to continue to look at the potential hardware incompatibility on your one system as the base cause, since none of this is (yet) repeatable on any other system. Thus the keyboard controller mention, which was the cause of the last similar problem. Any other dissimilarities between the failing system and your other PCs would be good to know, including whether the problem persists without any NIC or serial cards enabled/inserted (as well as possibly memory configuration changes).

Are any of the 'stack protection' options to gcc included in our toolchain btw?

Not really, I think there is a stack check option that emits a call to a stack checker that could be turned on, but ELKS itself checks the kernel stack pointer on every clock interrupt and system call entry, as well as checking the SS register for validity.

And what about the register declarations, are the ignored or actually used? I noticed gcc complains if you try to get the address of a register variable

I'm pretty sure register declarations are ignored with regards to forcing any register usage, but they're not entirely ignored as I've also seen the address of register variable error. I doubt we're dealing with a compiler issue. I will research the last issue a bit more and report back.

Mellvik commented 2 years ago

Thank you @ghaerr -

Although not a for sure, we likely need to continue to look at the potential hardware incompatibility on your one system as the base cause, since none of this is (yet) repeatable on any other system. Thus the keyboard controller mention, which was the cause of the last similar problem. Any other dissimilarities between the failing system and your other PCs would be good to know, including whether the problem persists without any NIC or serial cards enabled/inserted (as well as possibly memory configuration changes).

Agreed, that may well be the case. The ALI 6117C is a 386sx based SOC with most components on chip - interrupt controllers, timer/counter, DMA controllers, memory mapper, keyboard/mouse i/f (!) and much more. As mentioned in https://github.com/jbruchon/elks/issues/1303#issuecomment-1192324726 https://github.com/jbruchon/elks/issues/1303#issuecomment-1192324726 I've had IRQ5 problems with the system. Also mentioned before (this thread) is the main memory mapping, creating some odd main memory sizes (such as 607K available). This has been tested for the case in question and found to not influence the situation.

I doubt we're dealing with a compiler issue. I will research the last issue a bit more and report back.

Agreed - and thank you!

—M

ghaerr commented 2 years ago

Hello @Mellvik,

I have reviewed the previous issues relating to the ELKS kernel crashing, or acting strangely, and haven't yet been able to correlate to this problem. In order to eliminate the potential IRQ 1/keyboard controller as a culprit, try configuring for BIOS console instead of Direct console for this machine. The BIOS console doesn't interact directly with IRQ 1 nor the LEDs.

Will the sequence on your initial post always cause this issue? If so, that's good: the commands could be put in a shell script and executed immediately after boot, to more rapidly eliminate variables. I would suggest that this be done without an serial or NIC cards in the system as well.

Thank you!

Mellvik commented 2 years ago

Thank you @ghaerr. I have reviewed the previous issues relating to the ELKS kernel crashing, or acting strangely, and haven't yet been able to correlate to this problem. In order to eliminate the potential IRQ 1/keyboard controller as a culprit, try configuring for BIOS console instead of Direct console for this machine. The BIOS console doesn't interact directly with IRQ 1 nor the LEDs.

Yes, I will check that one in the morning. Agreed, eliminating possible cases is the only way forward. Will the sequence on your initial post always cause this issue? If so, that's good: the commands could be put in a shell script and executed immediately after boot, to more rapidly eliminate variables. I would suggest that this be done without an serial or NIC cards in the system as well.

I've been making some progress this past week. The problem is entirely repeatable, slight changes in the code change behaviour. I've got it to fail on the compaq, not with rm but mount, and mount also fails (now) on the SX. So it's not just unlink only anymore. Also, open with truncate will fail like rm, so truncate is still in the crosshairs. Moving a delay loop around in the code has turned out to be useful - and a race condition is still a possibility, although I have a hard time seeing how any of the events I'm looking at can be interrupted. If they were, my printk's would show it.

To be continued...

-M

ghaerr commented 2 years ago

The actual issue itself seems to be evolving... we know there's a kernel problem where the system can appear to hang due to not enough buffers. This manifests itself by sleeping the process in a deadlock, but the system itself is fine. That is, logging in from another terminal is possible.

With regards to this only happening on one system, you're now saying it's happening on two others? I would assume that this is likely because the problem wasn't iterated enough times on the newer systems, but shows up once in a while? And that the original system (with a different memory configuration) always hangs immediately?

Then there's the additional reported issue of kernel printk's not working... not sure how that fits in with the rest of the reported problems. Do printk's always not work, or only on the original problem system?

It appears the unlink system call calls the kernel rescheduler (possibly to allow other processes to release buffers). This means that the answer to your comment above about how events can be interrupted is that, if any other process is running (particularly ktcp, as it has select timeouts), the process can definitely get interrupted and switched out. Thus, all testing should be done on a bare bones system with as few other processes running as possible (suggest single-user mode, or init=/bin/sh in /bootopts).

I'm looking a bit at unlink versus ftruncate, and they're a bit complicated. So it would be nice to further narrow down all the reported issues before chasing too deep. You can answer these questions in your next report.

Thank you!

Mellvik commented 2 years ago

The actual issue itself seems to be evolving...

It is - with the addition and movement of printk's. we know there's a kernel problem where the system can appear to hang due to not enough buffers. This manifests itself by sleeping the process in a deadlock, but the system itself is fine. That is, logging in from another terminal is possible.

I don't think this is what we're looking at since the problem is immediate after boot, so plenty of buffers. The compaq actually hangs during boot because it's set up to mount 4 filesystems and stops in the 3rd. As to the type of hang, it varies - from silent hard hang via kernel messages about kernel stack corruption to process hangs and everything else working (and other variants) depending on the current printk's. Classic stack corruption indicators. With regards to this only happening on one system, you're now saying it's happening on two others?

Just one other - the 3rd is in hiatus to leave room for what I now call the ALI system, which is a single board computer the size of a NIC . I would assume that this is likely because the problem wasn't iterated enough times on the newer systems, but shows up once in a while?

I'm testing a kernel on the compaq only occasionally, And that the original system (with a different memory configuration) always hangs immediately?

With the printk settings I have now, both fail immediately (if differently) given the right commands. Then there's the additional reported issue of kernel printk's not working... not sure how that fits in with the rest of the reported problems. Do printk's always not work, or only on the original problem system?

That's 106 builds ago, but can be reproduced any time. It appears the unlink system call calls the kernel rescheduler (possibly to allow other processes to release buffers). This means that the answer to your comment above about how events can be interrupted is that, if any other process is running (particularly ktcp, as it has select timeouts), the process can definitely get interrupted and switched out. Thus, all testing should be done on a bare bones system with as few other processes running as possible (suggest single-user mode, or init=/bin/sh in /bootopts).

True, but the scheduler gets called only if there is a lack of resources or contention. There is not AFAIK. Then again I may be missing something.

More later.

M

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

ghaerr commented 2 years ago

The compaq actually hangs during boot because it's set up to mount 4 filesystems and stops in the 3rd

Three mounts is a lot of activity - introduces variables in perhaps FAT and MINIX filesystems. Recommend leaving all auto-mounts off for presentation of bug reproduction. If this only reproduces when FAT mounts are made, for instance, we could be getting kernel stack overflow... we saw this before on @tyama501's machine where BIOS disk I/O used tons of stack, and increasing KSTACK_BYTES by 100 in fixed the problem. You can also turn on CONFIG_STRACE on the failing system and it will display every system call and its associated kernel stack used.

but the scheduler gets called only if there is a lack of resources or contention. There is not AFAIK. Then again I may be missing something.

Yes - my comment was from looking at the kernel source for unlink: it explicitly calls schedule, which will reschedule the current process immediately:

int minix_unlink(register struct inode *dir, char *name, size_t len)
{   
    int retval;
    register struct inode *inode;
    struct buffer_head *bh;
    struct minix_dir_entry *de;

    goto init_loop;
    do {
        iput(inode);
        unmap_brelse(bh);
        schedule();  <--- not sure why, but here it is
  init_loop: 
        retval = -ENOENT;
        inode = NULL;
        bh = minix_find_entry(dir, name, len, &de);
        if (!bh) goto end_unlink2;
        if (!(inode = iget(dir->i_sb, (ino_t) de->inode))) goto end_unlink1;
        retval = -EPERM;
        if (S_ISDIR(inode->i_mode)) goto end_unlink;
    } while (de->inode != inode->i_ino);

Just trying to help get your problem debugged. We need a reproduction with the barest boot conditions possible, no extra processes running, no auto-mounts, no extra cards inserted, if possible. If we can reproduce under those conditions, it will be a lot faster when I have to go searching the kernel code for problems...

I suppose the good news is that you're finding this on other systems, which means we should hopefully get a reproduction on QEMU at some point, which will help me debug it, since I have little access to real hardware.

Thank you!

Mellvik commented 2 years ago

Thank you @ghaerr, Being forced to describe the issues and observations is helpful in itself, and discussing them even more so. This rabbit hole turned out to be deep (again) - and no rabbit ...

The compaq actually hangs during boot because it's set up to mount 4 filesystems and stops in the 3rd

Three mounts is a lot of activity - introduces variables in perhaps FAT and MINIX filesystems.

Agreed. AAMF I was elated to see the problem on a second machine, so I kept the setup while verifying that mount triggers the problem on the first system too - in certain circumstances. Recommend leaving all auto-mounts off for presentation of bug reproduction. If this only reproduces when FAT mounts are made, for instance, we could be getting kernel stack overflow... we saw this before on @tyama501's machine where BIOS disk I/O used tons of stack, and increasing KSTACK_BYTES by 100 in fixed the problem.

That's one I haven't tried yet, will check that. You can also turn on CONFIG_STRACE on the failing system and it will display every system call and its associated kernel stack used.

Good idea, I've never used this one - being sceptical about its effect on timing and performance. No is the time to change that. but the scheduler gets called only if there is a lack of resources or contention. There is not AFAIK. Then again I may be missing something.

Yes - my comment was from looking at the kernel source for unlink: it explicitly calls schedule, which will reschedule the current process immediately:

int minix_unlink(register struct inode dir, char name, size_t len) {
int retval; register struct inode inode; struct buffer_head bh; struct minix_dir_entry *de;
goto init_loop;
do {
    iput(inode);
    unmap_brelse(bh);
    schedule();  <--- not sure why, but here it is
init_loop: retval = -ENOENT; inode = NULL; bh = minix_find_entry(dir, name, len, &de); if (!bh) goto end_unlink2; if (!(inode = iget(dir->i_sb, (ino_t) de->inode))) goto end_unlink1; retval = -EPERM; if (S_ISDIR(inode->i_mode)) goto end_unlink; } while (de->inode != inode->i_ino); The thing is - the goto before the loop (this code is full of them, seems like a Torvalds speciality) ensures that schedule() is skipped in 'normal' cases. I believe the schedule call is there because another process may be waiting for the buffer just released (2nd pass or later). Just trying to help get your problem debugged. We need a reproduction with the barest boot conditions possible, no extra processes running, no auto-mounts, no extra cards inserted, if possible. If we can reproduce under those conditions, it will be a lot faster when I have to go searching the kernel code for problems...

Agreed and all feedback/dialog is appreciated. I've been using QEMU for comparing behaviour and in tests for reproducibility. No luck in that dept so far, not even when taking the QEMU performance setting down to 10 or 1%. I suppose the good news is that you're finding this on other systems, which means we should hopefully get a reproduction on QEMU at some point, which will help me debug it, since I have little access to real hardware.

It's indeed good. Further, the fact that I'm seeing it in situations other than truncate() is good or bad depending on perspective. In any case its part of the digging. The rabbit is still in the wind.

-M

Randrianasulu commented 2 years ago

I haven't tried personally running ELKS in pcem or 86box, but they both provide more accurate cpu timing or even hdd timing (in 86box git) emulation ..

Mellvik commented 2 years ago

Thank you Andrew,

I didn't realize 86box can run on MacOs. It does and it even has a setting for emulating the ALI processor I'm using.

Very encouraging.

—M

jul. 2022 kl. 00:47 skrev Andrew Randrianasulu @.***>:

I haven't tried personally running ELKS in pcem or 86box, but they both provide more accurate cpu timing or even hdd timing (in 86box git) emulation ..

— Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1367#issuecomment-1196055227, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOAXGTFICMVL7JB3MMDVWBTJPANCNFSM53KDGGZQ. You are receiving this because you were mentioned.

Mellvik commented 2 years ago

@ghaerr, the buffer bug is still elusive and progress so far is more eliminating what isn't the problem than getting really closer to it. A couple of possible bugs but no cigar.

While chasing down the differences between cases that work (XMS buffers) and cases that don't (ext buffers), here is some output I was hoping could provide a clue. The first sample (ignore the extra stuff, it's from printks indicating where we're at) is the ext64 case which fails in an endless loop in truncate_direct (td0). The numbers after the colon on the UNMAP: line are the parameters to xms_fmemcpyw (in map_buffer, fs/buffer.c). They are very different between the two cases - which is expected, but is there anything that stands out to you? The file being unlinked is named 'pp', a copy of /bootopts .... < 500 bytes.

EXEC: '/bin/rm' env 139
REMAP: 14
REMAP: 19
REMAP: 13
BLK 1593 read 400023c0:3
BLK 1594 read 23c0:3
BLK 1597 read 600023c0:3
BLK 1598 read 580023c0:3
BLK 1599 read 3c0023c0:3
BLK 1600 read 5c0023c0:3
REMAP: 14
BLK 20 read 440023c0:3
map:   17 try 8
UNMAP: 25 <- 8:6400:23c0:8826:13a0
REMAP: 14
UNLINK 'pp'
REMAP: 14
trEtd0map:   5 try 9
UNMAP: 18 <- 9:4800:23c0:8c26:13a0
td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0td0 etc. forever.

The second case is with XMS buffers active and works fine:

EXEC: '/bin/rm' env 139
REMAP: 14
REMAP: 19
REMAP: 13
BLK 1593 read 100000:4000
BLK 1594 read 100000:0
BLK 1597 read 100000:6000
BLK 1598 read 100000:5800
BLK 1599 read 100000:3c00
BLK 1600 read 100000:5c00
REMAP: 14
BLK 20 read 100000:4400
map:   17 try 8
UNMAP: 25 <- 8:6400:0:10:8876
REMAP: 14
UNLINK 'pp'
REMAP: 14
trEtd0map:   5 try 9
UNMAP: 18 <- 9:4800:0:10:8c76
td0td0td0td0td0td0trDtrItrXmfi:mb;map:   2 try 10
UNMAP: 19 <- 10:4c00:0:10:9076

Any input would be welcome.

Thank you -M

ghaerr commented 2 years ago

Hello @Mellvik,

Actually these two logs look remarkably similar - so much so that the block reads and buffer map/unmaps are identical, right until the moment where the system hangs (or rather, loops indefinitely in truncate, it seems).

I haven't had time to properly analyze MINIX truncate, to see 1) why it loops with a possible reschedule in the first place, and 2) how it might be hanging. However, in general ,the L1 buffer map/unmap is called (using map_buffer and unmap_buffer or unmap_brelse), to grab a disk buffer (in this case, looks like one of the disk mapping blocks for the inode being truncated), and then copies that external or XMS buffer into a local (DS-addressable) L1 buffer. The map_buffer routine itself, since we're limited in the kernel to only 12 or so of them) will sleep the process if there aren't enough free buffers, until unmap_buffer frees a buffer. So, in general, if any kernel process requires more than 12 buffers, and only a single process is using them, the system will hang.

I don't understand truncate well enough to see why it might need > 12 buffers: certainly there's buffers needed for each open directory in the namei directory filename processing routines, then another one for the inode itself, then probably one for each of the inode indirect block buffers, etc. But with a < 512 byte file, there's no indirect blocks, since MINIX keeps the first seven in the inode itself.

Also, there is no difference in the number of L1 buffers when running EXT vs XMS external buffers. It is possible, for some reason, that we have a kernel bug where an external buffer is not being released properly. You might test this be setting the EXT buffers to 128, instead of 64, and see whether that changes anything. I think ELKS will run with even more EXT buffers, for testing (all will be in main memory).

Finally, there may be a bug in truncate itself, where for some reason it loops thinking that it needs more L1 (mapped) buffers than it actually needs, and this is causing the busy loop.

The numbers after the colon on the UNMAP: line are the parameters to xms_fmemcpyw (in map_buffer, fs/buffer.c).

Instead of displaying the buffer addresses, suggest putting more printks in truncate, especially around the buffers it seems to want. The buffer block number can be printed using bh->b_blocknr, to better understand exactly which inode block truncate is looking at.

I assume there are NO other running processes on the system (no network, etc), right? This will matter with the rescheduling, which I discussed previously. If you have various filesystems mounted, it is possible that could be a problem also, as they could possibly be using an L1 buffer for each of them. It is entirely possible we are just running out of L1 buffers due to too many kernel resources being used (like active mounts, for instance). (If that's the case, and the problem goes away without the 4 mounted filesystems, then we can possibly correct that by having the filesystem re-grab the superblock for each, rather than keeping them mapped into L1). I haven't actually checked that they are locked in L1 space, just a guess.

I'll look more into this tomorrow.

Thank you!

Mellvik commented 2 years ago

Thank you @ghaerr - it seems I need a new angle on this one - including clearing out xms_fmemcpyw as a potential suspect.

Actually these two logs look remarkably similar - so much so that the block reads and buffer map/unmaps are identical, right until the moment where the system hangs (or rather, loops indefinitely in truncate, it seems).

Yes, the elusiveness of this problem is frustrating. It still seems that the problem is stack pollution (see below) so the symptoms appear in other places than the problem. I haven't had time to properly analyze MINIX truncate, to see 1) why it loops with a possible reschedule in the first place, and 2) how it might be hanging. However, in general ,the L1 buffer map/unmap is called (using map_buffer and unmap_buffer or unmap_brelse), to grab a disk buffer (in this case, looks like one of the disk mapping blocks for the inode being truncated), and then copies that external or XMS buffer into a local (DS-addressable) L1 buffer. The map_buffer routine itself, since we're limited in the kernel to only 12 or so of them) will sleep the process if there aren't enough free buffers, until unmap_buffer frees a buffer. So, in general, if any kernel process requires more than 12 buffers, and only a single process is using them, the system will hang.

I don't understand truncate well enough to see why it might need > 12 buffers: certainly there's buffers needed for each open directory in the namei directory filename processing routines, then another one for the inode itself, then probably one for each of the inode indirect block buffers, etc. But with a < 512 byte file, there's no indirect blocks, since MINIX keeps the first seven in the inode itself.

I've spent what feels like aeons in truncate by now and while it seems to be doing some strange things (and missing a cast to int in V1_trunc_direct, for the DIRECT_BLOCK macro, which I at one point figured was the problem), truncate is in the clear (btw truncate is called from only one place in the code and '__u32 i_size' is always 0, so the DIRECT_BLOCK macro is superfluous for now). The thing with truncate is that it turns out to be the place where the symptoms are easiest to trigger and reproduce, but its not the only place (mount is another, not as predictable, see below). V1_trunc_direct can fail at any point in the code depending on where the printks are placed. It can also be made to work reliably if the printks are place right - and some of the printks will be mysteriously skipped. BTW - while indirect and double indirect blocks are not involved (small file), they are always called anyway. Also, there is no difference in the number of L1 buffers when running EXT vs XMS external buffers. It is possible, for some reason, that we have a kernel bug where an external buffer is not being released properly. You might test this be setting the EXT buffers to 128, instead of 64, and see whether that changes anything. I think ELKS will run with even more EXT buffers, for testing (all will be in main memory).

I will check that. I've been using 32,64 and 80 EXT buffers so far, no difference. Finally, there may be a bug in truncate itself, where for some reason it loops thinking that it needs more L1 (mapped) buffers than it actually needs, and this is causing the busy loop.

The numbers after the colon on the UNMAP: line are the parameters to xms_fmemcpyw (in map_buffer, fs/buffer.c).

Instead of displaying the buffer addresses, suggest putting more printks in truncate, especially around the buffers it seems to want. The buffer block number can be printed using bh->b_blocknr, to better understand exactly which inode block truncate is looking at.

See above. Actually, buffer.c emits pretty much the same symptoms. Enabling the debug() statements in the code, and the problem goes away, then disabling/enabling individual statements change behaviour, even if the changes are made in parts of the code that isn't being executed. I assume there are NO other running processes on the system (no network, etc), right? This will matter with the rescheduling, which I discussed previously. If you have various filesystems mounted, it is possible that could be a problem also, as they could possibly be using an L1 buffer for each of them. It is entirely possible we are just running out of L1 buffers due to too many kernel resources being used (like active mounts, for instance). (If that's the case, and the problem goes away without the 4 mounted filesystems, then we can possibly correct that by having the filesystem re-grab the superblock for each, rather than keeping them mapped into L1). I haven't actually checked that they are locked in L1 space, just a guess.

The system is idle and other processes don't affect the problem. I have networking running most of the time, stopping the network and all gettys/logins don't change anything. Neither do the mounts, but the mount operation will sometimes hang the system. I haven't been able to make this one entirely predictable.- At first it seemed like the 3rd mount would always hang, then it was a specific mount (i.e. a file system), but the path is an unreliable (and slow) source because the crashes frequently screw up either the file system or the bootblock or both. BTW I have an updated version of make boot that helps - checks for the magic number at the end of the source bootblock and copies only the bootblock if requested. I don't think we're running out of L1 buffers since the system is idle most of the time and sometimes running only 5 processes. Also, a potential shortage of L1 buffers should not be affected by the choice of XMS or EXT L2 buffers, right?

I keep looking for local (stack allocated) pointers begin accidentally used as variables (frequent casting make them hard to catch for the compiler) and long/int mixups.

Thanks for looking into this.

—M

ghaerr commented 2 years ago

Hi @Mellvik,

I'm getting confused again on this issue, we might possibly be talking about different things - is the problem that rm hangs, or something else? My long response above is only to do with why a hang may be occurring, nothing else.

the mount operation will sometimes hang the system.

You're saying running mount hangs the system? Is this a separate problem, or to do with the issue of rm hanging?

Are all these duplicating regularly only on the one "bad" machine you have?

I understand you're saying that there is a separate problem relating to adding printk's that result in displaying garbage. When that happens, can you run that version on QEMU with the same result, or does this also only happen on the "bad" system?

a potential shortage of L1 buffers should not be affected by the choice of XMS or EXT L2 buffers, right?

Correct.

I don't think we're running out of L1 buffers since the system is idle most of the time and sometimes running only 5 processes.

From your log above - we are most definitely running out of L1 buffers, that is the reason truncate is looping. I explained that map_buffer will sleep the process, and some other process is issuing wakeup. This is the reason why we need to eliminate ALL other processes to track this down. (That is, if the problem is that of rm hanging, versus the other 3 problems mentioned in this issue).

Mellvik commented 2 years ago

Thank you @ghaerr -

yes, there is plenty of room for confusion and I've been at this one for so long that a lot of the parameters are easily taken for granted. So a few things (my perspective):

There is only one problem - stack corruption
rm hang is a symptom, a consequence of the above. There seem to be no problems in truncate or unlink.
the printks demonstrate this: Their presence, content (# of parameters) and placement change the layout of the stack in (say) truncate and thus the behaviour.
this can be repeated in buffer.c too. Leaving truncate out of the equation (i.e. no debug printk's in truncate routines), instead adding DEBUG to buffer.c and rm works just fine. Until we remove a couple of the debug statements - and we have a new variant of symptoms.
IOW, the nature of rm's failure varies with the placement of the debug statements: Some times small files can be deleted, some times all files can be deleted, some times the process loops, some times the kernel hangs silently, some times it panics (say, in brelse) - or like this:

elks17# rm /linux; mv linux /

panic: Process -31913 exceeded kernel stack limit! magic c483

SYSTEM HALTED - Press CTRL-ALT-DEL to reboot:

this behaviour is on the SX (ALI) system only, and goes away when XMS buffers are used.
The mount problem has (AFAICT) the same origin: A result of the stack corruption occurring if the 'right' debug statements are in the 'right' places.
The mount problem is not affected by XMS buffers, it's always the same and reproducible on the Compaq too, but not on QEMU.
Like with rm, the nature of the mount problem changes with the amount of debug stuff active - and placement/params etc. Initially, mount would hang predictably when mounting (say) /dev/hda3 but not /dev/hda2 (both minix file systems). Some changes later it will not hang at all when mounting any file system, but hang when running the command without arguments to show available space - iff a FAT file systems is mounted.

At one point I was thinking the reason might be related to A20, memory mapping and the 1M wraparound 'mechanism' - since it seems to be somewhat hw/system dependent. But since ELKS is not using the HMA, that cannot be it. I did notice though, that while both the compaq and the SX, when running with XMS buffers, report 'A20 was off, now on' while QEMU reports 'A20 was on, now on'. A curiosity - and I haven't pulled out the 286 to get a 3rd platform to test on - yet.

So, having chased the problem through every statement the truncate routines and seen some really head scratching behaviour, I'm back to buffer.c and the buffer management routines - not really expecting to find the problem, but to find a pointer to it.

Thanks for clarifying the L1 mechanism. You're right - wait/schedule may indeed have to do with it.

BTW - truncate is looping only when its stack is corrupted. When working normally, V1_trunc_direct loops through all (7) blocks of the zone even when the file size is smaller than the first block, then exits. Room for improvement, but then again, unlink or truncate performance was never an issue.

—M

aug. 2022 kl. 17:02 skrev Gregory Haerr @.***>:

Hi @Mellvik https://github.com/Mellvik,

I'm getting confused again on this issue, we might possibly be talking about different things - is the problem that rm hangs, or something else? My long response above is only to do with why a hang may be occurring, nothing else.

the mount operation will sometimes hang the system.

You're saying running mount hangs the system? Is this a separate problem, or to do with the issue of rm hanging?

Are all these duplicating regularly only on the one "bad" machine you have?

I understand you're saying that there is a separate problem relating to adding printk's that result in displaying garbage. When that happens, can you run that version on QEMU with the same result, or does this also only happen on the "bad" system?

a potential shortage of L1 buffers should not be affected by the choice of XMS or EXT L2 buffers, right?

Correct.

I don't think we're running out of L1 buffers since the system is idle most of the time and sometimes running only 5 processes.

From your log above - we are most definitely running out of L1 buffers, that is the reason truncate is looping. I explained that map_buffer will sleep the process, and some other process is issuing wakeup. This is the reason why we need to eliminate ALL other processes to track this down. (That is, if the problem is that of rm hanging, versus the other 3 problems mentioned in this issue).

— Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1367#issuecomment-1202764132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOGXSQN26BH3RXCSOHDVXEZ6RANCNFSM53KDGGZQ. You are receiving this because you were mentioned.

ghaerr commented 2 years ago

So basically, all of these issues (except mount) occur only on the SX/ALI system, right?

If so, we need to consider BIOS or hardware issues (including possible I/O delays only seen on that computer). It is possible that the SX BIOS is using too much stack, or that the stack is being trashed, but that this is seen only under the following circumstances:

Because truncate will call reschedule, and that L1 buffers can cause a hang (remember, I can duplicate the rm hang issue on QEMU), then what possibly is happening is reschedule is trashing the stack somehow. The scenario we are discussing is one of the few cases coded in the kernel where a reschedule is coded, that may return directly to the calling application, rather than sleeping which is the result in almost all other reschedules. The reason for this on the SX system could be the speed of its I/O (rather than a BIOS stack issue), that just happens to cause a reschedule() bug to show.

To test more along these lines, perhaps include more (or less) schedule() calls in various places. Technically, there should be no issue calling schedule() from the kernel at any point, but perhaps that is the problem here: schedule will switch stacks, then switch back, corrupted. All other applications end up calling wait() which will cause a call to schedule AFTER that process tries to return from the system call. See what I mean?

Another possibility would be to comment out the schedule() call and see if that changes anything. Worst case, we might end up with a busy-loop or hang problem, but there would not be any stack corruption, and printk's should work.

Mellvik commented 2 years ago

Thank you @ghaerr - That's a couple of very interesting angles! I didn't realize that schedule() would switch stacks in a case like this, thanks - that's very useful.

And I was under the impression that the rm hang situation on QEMU wasn't reproducible- is that right?

So basically, all of these issues (except mount) occur only on the SX/ALI system, right?

If so, we need to consider BIOS or hardware issues (including possible I/O delays only seen on that computer). It is possible that the SX BIOS is using too much stack, or that the stack is being trashed, but that this is seen only under the following circumstances:

Because truncate will call reschedule, and that L1 buffers can cause a hang (remember, I can duplicate the rm hang issue on QEMU), then what possibly is happening is reschedule is trashing the stack somehow. The scenario we are discussing is one of the few cases coded in the kernel where a reschedule is coded, that may return directly to the calling application, rather than sleeping which is the result in almost all other reschedules. The reason for this on the SX system could be the speed of its I/O (rather than a BIOS stack issue), that just happens to cause a reschedule() bug to show.

To test more along these lines, perhaps include more (or less) schedule() calls in various places. Technically, there should be no issue calling schedule() from the kernel at any point, but perhaps that is the problem here: schedule will switch stacks, then switch back, corrupted. All other applications end up calling wait() which will cause a call to schedule AFTER that process tries to return from the system call. See what I mean?

Another possibility would be to comment out the schedule() call and see if that changes anything. Worst case, we might end up with a busy-loop or hang problem, but there would not be any stack corruption, and printk's should work.

Yes, this is a fresh start, I'll start off with the latter and take it from there. BTW: the speed is indeed very different between the SX and the compaq. The SX feels an order of magnitude or more faster, almost like QEMU except when booting from floppy.

Thanks.

--M

ghaerr commented 2 years ago

panic: Process -31913 exceeded kernel stack limit! magic c483

This message could mean old-fashioned kernel stack overflow.

Lets increase the kernel stack size, just in case the SX system BIOS happens to be using a lot. Change KSTACK_BYTES in include/linuxmt/limits.h from 640 to 740, and lets see what happens.

I've been tracing through the truncate code, and its quite complicated, with all sorts of routines called, many of which grab more L1 buffers.

ghaerr commented 2 years ago

(btw truncate is called from only one place in the code and '__u32 i_size' is always 0, so the DIRECT_BLOCK macro is superfluous for now).

Yes, inode->i_size is set to 0 by iput which indicates that the kernel wants the size truncated to 0 (this is the path taken when minix_unlink is called, after decrementing the nlink count).

However, minix_truncate can also be (indirectly) called through the iop->truncate function pointer. This occurs in a couple places, in the truncate system call, where the i_size value is set to the end result truncated size requested, and also on the open system call when O_TRUNC is specified (which happens during creat). Thus, the DIRECT_BLOCK macro is correct in its use.

ghaerr commented 2 years ago

I didn't realize that schedule() would switch stacks in a case like this, thanks - that's very useful.

It may not switch stacks, depends on what else is running. tswtch is the function that actually switches stacks, called from within schedule().

I was under the impression that the rm hang situation on QEMU wasn't reproducible- is that right?

None of us can get a "basic" (removing a single file) rm hang on QEMU, but I have an old TODO list on a more complex rm -rf hang which, because it uses up all the L1 buffers, and there are no other processes running, will produce a hang every time in QEMU.

Mellvik commented 2 years ago

panic: Process -31913 exceeded kernel stack limit! magic c483

This message could mean old-fashioned kernel stack overflow.

Lets increase the kernel stack size, just in case the SX system BIOS happens to be using a lot. Change KSTACK_BYTES in include/linuxmt/limits.h from 640 to 740, and lets see what happens.

Good idea. I don't think we really have a stack problem since this happens only when there is an erroneous loop (presumably) eating the stack, and it's good to be able to rule it in or out. I've been tracing through the truncate code, and its quite complicated, with all sorts of routines called, many of which grab more L1 buffers.

Indeed it is. And this may be a good time to take a close look for possible improvements - not the least reducing the pressure on L1 buffers. After all, when deleting, the only need for buffers is for the inodes - and for small files in particluar ... etc etc. Let's have a close look after we get the stack issue resolved.

Thank you.

-M

Mellvik commented 2 years ago

(btw truncate is called from only one place in the code and '__u32 i_size' is always 0, so the DIRECT_BLOCK macro is superfluous for now).

Yes, inode->i_size is set to 0 by iput which indicates that the kernel wants the size truncated to 0 (this is the path taken when minix_unlink is called, after decrementing the nlink count).

However, minix_truncate can also be (indirectly) called through the iop->truncate function pointer. This occurs in a couple places, in the truncate system call, where the i_size value is set to the end result truncated size requested, and also on the open system call when O_TRUNC is specified (which happens during creat). Thus, the DIRECT_BLOCK macro is correct in its use.

Aaaah, of course. Thanks.

Now, here's the thing with the DIRECT_BLOCK macro: it returns an __u32 - which in the subsequent 2 macros is being cast to int. However, in V1_trunc_direct() it's being used twice as if it were an int. This seems to me like a bug that will go unnoticed as long as we're deleting files (value 0), but kick in when truncating a file.

—M

Mellvik commented 2 years ago

I was under the impression that the rm hang situation on QEMU wasn't reproducible- is that right?

None of us can get a "basic" (removing a single file) rm hang on QEMU, but I have an old TODO list on a more complex rm -rf hang which, because it uses up all the L1 buffers, and there are no other processes running, will produce a hang every time in QEMU.

OK; that's something to look into while were at it. If this hangs in QEMU it should do so on the real thing too (somewhat depending on what else is going on obviously).

Talking about QEMU - I'm seeing interesting errors that are different from the ones on other platforms, but likely related, such as on the attached screenshot. This is from a freshly mounted (and fsck'd) HD file system, actually an identical copy of the one I'm using on the SX. It may be interesting to enable DEBUG_BLK and see what's going on. This is repeatable.

—M

ghaerr commented 2 years ago

Hello @Mellvik, can you (re)post the QEMU screenshot showing the associated error?

Now, here's the thing with the DIRECT_BLOCK macro: However, in V1_trunc_direct() it's being used twice as if it were an int

The macro, which converts a disk file size into a MINIX block number by shifting right 10 bits, will be automatically converted to an 'int' if stored, or, when compared to an int, the other operand will be automatically converted to a long. In the latter case, the gcc compiler is very good about generating code, and likely doesn't actually compare the upper 16 bits. I see your point that this macro could likely use an (int) cast, however. We can look at the code generated and see whether an (int) casting helps, after solving this issue. The reason it isn't buggy is that the max block number for any MINIX file will always fit in 16 bits (65K), which, when shift left back to a long, is the max filesystem size: 65M.

Mellvik commented 2 years ago

aug. 2022 kl. 16:14 skrev Gregory Haerr @.***>:

Hello @Mellvik https://github.com/Mellvik, can you (re)post the QEMU screenshot showing the associated error?

I edited the post … let me know if it doesn't work.

—m

ghaerr commented 2 years ago

Hello @Mellvik,

I have duplicated the QEMU-staged "mnb: still zero bit!8192" bug on my system. I'm not sure if this is a relic of just having run out of disk space, or whether this is related to a problem keeping up with the MINIX allocated-block bitmap. I have noted this, and will treat this bug separately from this issue. I plan on looking further into it shortly. This is very likely not related to this open issue at all, but I'll test a bit more and report back.

In the meantime, definitely lets plan on continued testing as you've described above, separate fro this new issue, thanks!

Mellvik commented 2 years ago

Thank you @ghaerr.

Agreed - and as the mount commands show, there is ample space on the fs.

Also I agree it's likely a separate issue, it would be interesting to see if it goes away if XMS buffers are compiled in. I haven't checked that. When we find the bug this thread is focusing on, we'll know whether they are related.

Apropos - I've looked at calls to schedule() as a possible part in the scenario and realised I've been down that path before, after we talked about it a month or so ago. The thing is - and as far as I can tell from the tracings, schedule() is never called in the regular path of an unlink, neither is sleep_on. wait_on_buffer() is called a lot but the buffer is never locked in any of my cases. I still have some work to do to understand the L1 operation in this.

—M

aug. 2022 kl. 19:06 skrev Gregory Haerr @.***>:

Hello @Mellvik https://github.com/Mellvik,

I have duplicated the QEMU-staged "mnb: still zero bit!8192" bug on my system. I'm not sure if this is a relic of just having run out of disk space, or whether this is related to a problem keeping up with the MINIX allocated-block bitmap. I have noted this, and will treat this bug separately from this issue. I plan on looking further into it shortly. This is very likely not related to this open issue at all, but I'll test a bit more and report back.

In the meantime, definitely lets plan on continued testing as you've described above, separate fro this new issue, thanks!

— Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1367#issuecomment-1204238608, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOHAUHJUMYASVHNAWE3VXKRH7ANCNFSM53KDGGZQ. You are receiving this because you were mentioned.

ghaerr commented 2 years ago

With regards to the "new" bug, lets call it "out of disk space" bug:

Also I agree it's likely a separate issue, it would be interesting to see if it goes away if XMS buffers are compiled in. I haven't checked that.

I'm testing with XMS, and the bug is present. I know what it is: MINIX keeps an 8192-byte bitmap, which equals 8192*8 = 64K bits, where each bit is a 1K block in the max 64K minix V1 filesystem size. There are actually two bugs here: the first bug is that the system isn't checking when the bitmap is >= 8192 properly, and gives the incorrect error "mnb: bit still set!8192" in that case. I am working on fixing that now.

as the mount commands show, there is ample space on the fs.

That's the second bug, I have just realized. I am testing on a 32Mb filesystem, and it is running out of space and giving the first error message. I have not been able to (yet) duplicate the second bug, which is saying there's no more disk space when there should be.

I will fix the "first" portion of the out of disk space bug in a PR, then you can try to repeat the "second" out of disk space bug.

as far as I can tell from the tracings, schedule() is never called in the regular path of an unlink,

Please comment out schedule() in the truncate code, and let me know what happens.

neither is sleep_on

Please add a printk before the "sleep_on" in map_buffer to ensure the process is never put to sleep. This will let us know for sure whether you are correct in the process never sleeping (which means never running out of L1 buffers).

I still have some work to do to understand the L1 operation in this.

L1 operation is actually quite simple - map_buffer and unmap_buffer take the buffer, and copy its data contents to an L1 region in the kernel data segment. The bh_data is then directly addressable from the other kernel routines, without using a far pointer. If there are no L1 buffers left, the process is put to sleep. Its about that simple. We just need to prove that we're not ever sleeping, to eliminate a bunch of variables for your issue. (When the buffer is later unmapped, the buffer contents are copied back to the "real" L2 buffer).

If you comment out schedule and printk sleep_on, we'll know much more and be able to more rapidly squash this bug!

Mellvik commented 2 years ago

Finally some (real) progress on this issue:

The problem is a ‘memory leak’. 4 bytes at a fixed address in the kernel text segment are being overwritten at some point during the boot & startup process. Thus the erratic symptoms. Changing one line of code somewhere (before that address) changes the behaviour completely. (BTW - memory error has been ruled out - again, via extensive testing).
system-nm.map and the hd command were instrumental in finally locating the problem.

The location and content are always the same, c7 ef 00 f0 @ 0x16238 in the kernel text segment. The following example shows uncorrupted (first) then corrupted data.

elks17# hd d0:16238#4  <— floppy booted
006f38: c2 b8 fe ff eb cc 56 57  55 89 e5 83 ec 06 8d 46  ......VWU......F
elks17# hd d0:16238#4.  <—— HD booted
006f38: c7 ef 00 f0 eb cc 56 57  55 89 e5 83 ec 06 8d 46  ......VWU......F

This happens only on the SX machine, never on Compaq (386dx) or qemu. The Compaq has a hw config that is very similar to the SX.
Changing from EXT to XMS buffers only changes the layout of the code. The overwrite is at exactly the same place, now corrupting some code I haven’t triggered yet.

Now the good news, that may explain why the problem occurs only on this particular piece of hardware:

When booting from floppy, the problem is gone. Mounting hd file systems thereafter is no problem. Booting from floppy with root on HD (via bootopts) is no problem either.
The size of the hd minix fs doesn’t matter (I’ve tried only 2)

So the problem appears ONLY when booting the SX from HD, which should limit the scope of the continued hunt. So the boot device matters, the root device does not. Even at this level, the difference between the Compaq and the SX is limited. Both use an IDE disk channel, both use the same IDEtoCF adapter and the same CF card (512MB). Possibly something DPT/DPB related or data returned from BIOS calls?

I’m looking at that right now. In any case - as it turns out, the name of this issue is wrong - this is not a buffer problem, and most unlikely a fs problem. We're getting there.

To be continued.

-M

ghaerr commented 2 years ago

Hello @Mellvik,

Wow, you really have made some progress on this!! I'm interested in how you were able to pinpoint the memory locations that are being overwritten?

I'd like to understand a bit more about exactly what this location is in the kernel, and whether it has any special meaning. In doing so, we need to "normalize" our segment:offset pairs for use with hd, and in testing, I've already ran into a possible problem with hd that I'm going to look into.

With regards to the normalizing, let's look at the special address "d0:16238". The offset value is 64k (=10000) greater than will fit in an 8086 offset, so this is the same as "d0:6238". What we want to do is look at the kernel startup banner and use the kernel code segment value from the following line:

Kernel text at 2d0:0000, ftext 1210:0000, data 1512:0000, top 9fc0:0, 490K free

So, using 2d0, we should be able to (re)write the d0:6238 value as 2d0:\. (Technically, to get the 20-bit memory address, the segment value is shifted left 4 bits and the offset added: d0:6238 = D00 + 6238 = 0:6F38). Thus, d0:6238 is the same as 2d0:4238.

If I run the following to get a sorted kernel map file:

cat system-nm.map | sed -e '/&/d; /!/d' | sort > file

Looking in the file for somewhere near '00014238' shows the affected location as in the middle of the sock_read code:

0001418d t sock_write
000141eb t sock_read
00014249 t sock_release
0001427b t sock_close

Can you check with the normalized (likely 2D0) segment and see what the normalized offset is, and its position in the kernel map file?

If you find the actual memory leak is at 2D0:6238, which corresponds to the kernel map file address 00016238 (which may be hard to do given the possible hd bug below), then, looking at the kernel map file, 00016238 is (guess what?) right in the middle of minix_unlink, which fits with why the kernel became unstable after an unlink syscall:

00015fba T minix_rmdir
0001613a T minix_unlink
0001629d T minix_symlink

On another note, when trying hd, I try to rewrite the d0:6238 address with segment 2d0: I find that hd seems to have a bug:

# hd d0:6238#4
006f38: 12 92 83 c4 08 eb de ba  f7 ff eb d9 ba ea ff eb  ................
# hd 2d0:6638#4
006f38: 04 ff 76 fc 8b 5e 08 ff  77 20 e8 a2 ed 83 c4 04  ..v..^..w ......

Note that both addresses passed to hd give the same 20-bit 006f38 memory address (which is incorrect), but it is showing the contents differently, as expected. I'll report on this bug shortly.

Thank you!

Mellvik commented 2 years ago

Hi @ghaerr, I just noticed the omission in my report, please ignore the '2d0' vs 'd0' segment difference.

I forgot to correct that - they’re from two different runs, different builds. Some times the kernel txt segment is 2d0, sometimes it’s d0.

So I did this the hard way. The address is in the middle of kernel code, more specifically (depends on the build) in V1_trunc_direct unless we build with XMS buffers, then it’s a bit further out if I remember correctly. The hard way is as follows: V1_trunc_direct() is acting up so get the address from system-nm.map (I’ve been studying disassembly listings of truncate.o for weeks, changing one statement at a time to try to understand what’s been going on, so I’m more familiar with the low level code than I ever wanted to. BTW - the complier does an incredible job optimizing, that’s different story) Dump the memory region using hd, compare with the obj dump output (mind you, on paper - yes, I know …) Find the same area in the kernel file to make sure we’re not looking at relocation and linking differences. Identify differences. After rigorous verification we have the verdict, which corresponds to the behavior. Including the weird bitmap errors we got the other day.

What I was on my way to look at when your mail arrived, was bioshd.c and how it probes drives, in order to see if the BIOS returns odd values for the DiskOnChip which the system supports. The DOC has not been installed during testing, but it may affect how the BIOS responds to probes.

—M

aug. 2022 kl. 18:01 skrev Gregory Haerr @.***>:

Hello @Mellvik https://github.com/Mellvik,

Wow, you really have made some progress on this!! I'm interested in how you were able to pinpoint the memory locations that are being overwritten?

I'd like to understand a bit more about exactly what this location is in the kernel, and whether it has any special meaning. In doing so, we need to "normalize" our segment:offset pairs for use with hd, and in testing, I've already ran into a possible problem with hd that I'm going to look into.

With regards to the normalizing, let's look at the special address "d0:16238". The offset value is 64k (=10000) greater than will fit in an 8086 offset, so this is the same as "d0:6238". What we want to do is look at the kernel startup banner and use the kernel code segment value from the following line:

Kernel text at 2d0:0000, ftext 1210:0000, data 1512:0000, top 9fc0:0, 490K free So, using 2d0, we should be able to (re)write the d0:6238 value as 2d0:. (Technically, to get the 20-bit memory address, the segment value is shifted left 4 bits and the offset added: d0:6238 = D00 + 6238 = 0:6F38). Thus, d0:6238 is the same as 2d0:4238.

If I run the following to get a sorted kernel map file:

cat system-nm.map | sed -e '/&/d; /!/d' | sort > file Looking in the file for somewhere near '00014238' shows the affected location as in the middle of the sock_read code:

0001418d t sock_write 000141eb t sock_read 00014249 t sock_release 0001427b t sock_close Can you check with the normalized (likely 2D0) segment and see what the normalized offset is, and its position in the kernel map file?

If you find the actual memory leak is at 2D0:6238, which corresponds to the kernel map file address 00016238 (which may be hard to do given the possible hd bug below), then, looking at the kernel map file, 00016238 is (guess what?) right in the middle of minix_unlink, which fits with why the kernel became unstable after an unlink syscall:

00015fba T minix_rmdir 0001613a T minix_unlink 0001629d T minix_symlink On another note, when trying hd, I try to rewrite the d0:6238 address with segment 2d0: I find that hd seems to have a bug:

hd d0:6238#4

006f38: 12 92 83 c4 08 eb de ba f7 ff eb d9 ba ea ff eb ................

hd 2d0:6638#4

006f38: 04 ff 76 fc 8b 5e 08 ff 77 20 e8 a2 ed 83 c4 04 ..v..^..w ...... Note that both addresses passed to hd give the same 20-bit 006f38 memory address (which is incorrect), but it is showing the contents differently, as expected. I'll report on this bug shortly.

Thank you!

— Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1367#issuecomment-1210913480, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOFU626DNTW266NVKPTVYPG4RANCNFSM53KDGGZQ. You are receiving this because you were mentioned.

ghaerr commented 2 years ago

With regards to the origin of the memory corruption:

elks17# hd d0:16238#4.  <—— HD booted
006f38: c7 ef 00 f0 eb cc 56 57  55 89 e5 83 ec 06 8d 46  ......VWU......F

The four-byte value looks suspiciously like a possible BIOS ROM address (retrieved from memory would be): F0 00 EF C7. If its an address, this would be BIOS ROM at segment F000, address EFC7. I'm wondering whether it is a strange BIOS bug possibly to do with our bioshd driver setting a custom DDPT, or something like that.

I am just about to release my next-level debug tools, which feature a stack tracer and disassembler, both translating addresses to symbol names from the application's symbol file. I don't yet have it working inside the kernel. It currently allows for an application to disassemble portions of itself or perform a stack backtrace.

I think I'll add a seperate disasm utility that works like hd, where any segment:offset address can be given, and have it able to take a .map file, which could then allow for arbitrary disassembly of (kernel) memory, complete with kernel symbols. This should be able to help with this problem. I'll also fix the hd bug above.

Back the the corruption problem: I'm wondering if perhaps the low-core DDPT vector isn't being set properly for HD, but is for FD, for instance. This is the 1Eh vector, so its low core address would be 1E<<2 = 0:0078. Perhaps you should check to see what might be happening at that address. Also, in bioshd.c, you might turn on DEBUG_BIOS in debug.h, which will turn on some printks for DDPT copying (see copy_ddpt). Looking at this code, however, shows ELKS only sets the 1E low core vector to point at a new DDPT.

ghaerr commented 2 years ago

was bioshd.c and how it probes drives, in order to see if the BIOS returns odd values for the DiskOnChip which the system supports.

You may be on to something here: lets take a closer look at the probing routines used by bioshd.c for the devices, in particular bioshd_gethdinfo (and possibly bioshd_getfdinfo). It's a little ugly in there, since we're temporarily sharing code for IBM and PC98, but there are notes like the following, which are a bit suspicious:

    BD_AX = BIOSHD_DRIVE_PARMS;
    BD_DX = 0x80;               /* query hard drives only*/
    BD_ES = BD_DI = BD_SI = 0;  /* guard against BIOS bugs*/ // <---!!!
    if (!call_bios(&bdt))
        ndrives = BD_DX & 0xff;
    else
        debug_bios("bioshd: get_drive_parms fail on hd\n");

I'm thinking perhaps just commenting out the call_bios entirely (possibly just fake filling in the results, but not making the BIOS call). It is definitely possible that the SX BIOS is trashing a location as a result of one of these calls.

Probably turn off CONFIG_IDE_PROBE as well.

Another possibility is to look very closely at the boot screen display differences for bioshd: output when booting from FD vs HD.

Mellvik commented 2 years ago

Thank you @ghaerr -

With regards to the origin of the memory corruption:

elks17# hd d0:16238#4. <—— HD booted 006f38: c7 ef 00 f0 eb cc 56 57 55 89 e5 83 ec 06 8d 46 ......VWU......F The four-byte value looks suspiciously like a possible BIOS ROM address (retrieved from memory would be): F0 00 EF C7. If its an address, this would be BIOS ROM at segment F000, address EFC7. I'm wondering whether it is a strange BIOS bug possibly to do with our bioshd driver setting a custom DDPT, or something like that.

I was thinking along those lines too, but I didn’t follow the discussions closely when this stuff was updated, so I’m not familiar with it. I believe it’s important to keep in mind that the corruption doesn’t happen if we boot from floppy, then use HD as root. So whatever happens, it’s very early. I am just about to release my next-level debug tools, which feature a stack tracer and disassembler, both translating addresses to symbol names from the application's symbol file. I don't yet have it working inside the kernel. It currently allows for an application to disassemble portions of itself or perform a stack backtrace.

I think I'll add a seperate disasm utility that works like hd, where any segment:offset address can be given, and have it able to take a .map file, which could then allow for arbitrary disassembly of (kernel) memory, complete with kernel symbols. This should be able to help with this problem. I'll also fix the hd bug above.

Great idea - that would be very welcome. Hd-bug, thanks - it’s on my list too, I’m deleting it now that you’re on it. Back the the corruption problem: I'm wondering if perhaps the low-core DDPT vector isn't being set properly for HD, but is for FD, for instance. This is the 1Eh vector, so its low core address would be 1E<<2 = 0:0078. Perhaps you should check to see what might be happening at that address.

I don’t know what to look for yet, but I’ll compare the values on floppy booted and HD booted systems. Here’s the HD booted: elks17# hd 2d0:16238#4 006f38: c7 ef 00 f0 fa 52 ff 74 20 e8 a9 ed 83 c4 04 eb .....R.t ....... elks17# elks17# hd 0:78#10 000078: b4 b9 9c 13 e0 5a 00 c0 a0 87 00 f0 a0 87 00 f0 .....Z.......... elks17# Also, in bioshd.c, you might turn on DEBUG_BIOS in debug.h, which will turn on some printks for DDPT copying (see copy_ddpt). Looking at this code, however, shows ELKS only sets the 1E low core vector to point at a new DDPT.

OK, I’ll be back with more …

—M

tkchia commented 2 years ago

Hello @ghaerr, hello @Mellvik,

You may be on to something here: lets take a closer look at the probing routines used by bioshd.c for the devices, in particular bioshd_gethdinfo (and possibly bioshd_getfdinfo). It's a little ugly in there, since we're temporarily sharing code for IBM and PC98, but there are notes like the following, which are a bit suspicious:
    BD_AX = BIOSHD_DRIVE_PARMS;
    BD_DX = 0x80;               /* query hard drives only*/
    BD_ES = BD_DI = BD_SI = 0;  /* guard against BIOS bugs*/ // <---!!!
    if (!call_bios(&bdt))
        ndrives = BD_DX & 0xff;
    else
        debug_bios("bioshd: get_drive_parms fail on hd\n");

The "fix" follows a suggestion from Ralf Brown's Interrupt List:

--------B-1308-------------------------------
INT 13 - DISK - GET DRIVE PARAMETERS (PC,XT286,CONV,PS,ESDI,SCSI)
        AH = 08h
        DL = drive (bit 7 set for hard disk)
        ES:DI = 0000h:0000h to guard against BIOS bugs
Return: CF set on error
            AH = status (07h) (see #00234)
        CF clear if successful
            AH = 00h
            AL = 00h on at least some BIOSes
            BL = drive type (AT/PS2 floppies only) (see #00242)
            CH = low eight bits of maximum cylinder number
            CL = maximum sector number (bits 5-0)
                 high two bits of maximum cylinder number (bits 7-6)
            DH = maximum head number
            DL = number of drives
            ES:DI -> drive parameter table (floppies only)

...

        some Toshiba BIOSes (at least before 1995, maybe some laptops???
          with 1.44 MB floppies) have a bug where they do not set the ES:DI
          vector even for floppy drives. Hence these registers should be
          preset with zero before the call and checked to be non-zero on
          return before using them.  Also it seems these BIOSes can return
          wrong info in BL and CX, as S/DOS 1.0 can be configured to preset
          these registers as for an 1.44 MB floppy.

Also, the DDPT — as returned via es:di, and which should be pointed to by the int 0x1e vector — is only relevant for floppy drives, not for hard drives. My understanding is that parameters that govern hard drive I/O are specified by a different set of vectors, namely int 0x41, int 0x46, etc.

Thank you!

tkchia commented 2 years ago

Hello @Mellvik,

I don’t know what to look for yet, but I’ll compare the values on floppy booted and HD booted systems. Here’s the HD booted: elks17# hd 2d0:16238#4 006f38: c7 ef 00 f0 fa 52 ff 74 20 e8 a9 ed 83 c4 04 eb .....R.t ....... elks17# elks17# hd 0:78#10 000078: b4 b9 9c 13 e0 5a 00 c0 a0 87 00 f0 a0 87 00 f0 .....Z.......... elks17#

As a sanity check, perhaps try dumping the bytes that these pointers are pointing to? c7 ef 00 f0 would be 0xf000:0xefc7 (as @ghaerr pointed out). And b4 b9 9c 13 would be 0x139c:0xb9b4. These should point to 11-byte (or 14-byte) DDPTs and should be mostly identical. So... maybe try

# hd f000:efc7#e
# hd 139c:b9b4#e  # or whatever address you read off from 0:78

(And some other possible sanity checks:

Is e.g. 0x139c:0xb9b4 the runtime address of the DDPT[] variable in the ELKS kernel?
Does the vec1E variable always read 78 00 00 00?)

Thank you!

ghaerr commented 2 years ago

Hello @Mellvik, Hello @tkchia,

I agree with all of @tkchia's suggestions, they are very good ideas. I think we can get to the bottom of this, but I want to point out that with all the segment:offset addresses being presented, it is important that two things be done after each new kernel build, so that we can more easily understand the results of each test (and yes, I am working to automate this):

Run cat system-nm.map | sed -e '/&/d; /!/d' | sort > linux.map to generate a readable kernel symbol table.
Record the text, fartext, and data segment values from the kernel boot line:
```
Kernel text at 2d0:0000, ftext 1210:0000, data 1512:0000
```
The above would show 2d0: mapping to symbols in the linux.map 0x0001XXXX section (.text) of the kernel.map file, 1210: to 0x0002XXXX section (.fartext), and 1512: mapping to 0x0003XXXX (.data/.bss) section.

Then, only use one of the three segment values in hd for display, and rewrite any longword values into segment:offset. For instance, with @tkchia's suggestion above to look at 0x139c:0xb9b4, this makes sense when the kernel .data segment is 0x139c, and would then allow to search the kernel map file for 0x0003b9b4. (In my mapfile, this is not the DDPT[] variable, but much higher, in the .bss area).

I plan on adding a couple more capabilities and files to the distribution: the ability to generate a human readable sorted .map file for the kernel or any application, as well as a .sym file (binary format) of the same, for use in the hd and disasm tool enhancements. The kernel files could be called linux.map and linux.sym and could be placed in the ELKS /lib directory for use by the tools or humans for real-time debugging. (There is still the problem of determining the runtime kernel .text, .fartext and .data segment values to map to the appropriate sections, any thoughts on how to do that?)

Thank you!