robert-w-gries / rxinu

Rust implementation of Xinu educational operating system
Apache License 2.0
33 stars 4 forks source link

Possible PIT implementation. #43

Closed toor closed 6 years ago

toor commented 6 years ago

I have forked rxinu and implemented a basic PIT driver. The PIT, under this model, operates in Mode 3, using lobyte/hibyte configuration. I have experimented with moving the resched() call into the timer irq, and saw that a couple of test processes ran, printing to screen. However, after that it stopped, which is something I need to debug. Anyway, I thought you might like to check out my most recent few commits and give your opinion on the code so far.

Screenshot of rxinu running with the new model: image

robert-w-gries commented 6 years ago

Thanks @too-r! It looks like the PIT driver is working as intended.

Regarding the issue you saw, it looks like there were a few issues involving deadlocks in resched. The resched method will deadlock if any scheduling related locks are still held when it is called.

Problem 1: Incorrect locking of ready_list in resched

Debugging

I found out where the kernel is deadlocking by running gdb, hitting the deadlock, and running info stack

Call #2 in the stack is the culprit:

#2  0x000000000010cdb2 in rxinu::scheduling::cooperative_scheduler::{{impl}}::resched (
    self=0x146538 <<rxinu::scheduling::SCHEDULER as core::ops::deref::Deref>::deref::__stability::LAZY+8>) at src/scheduling/cooperative_scheduler.rs:102

It points to the following following block:

// skip expensive locks if possible
if self.ready_list.read().is_empty() {
    return;
}
Click to expand debugging output ``` (gdb) info stack #0 0x000000000012542b in core::sync::atomic::AtomicUsize::load ( self=0x146568 <::deref::__stability::LAZY+56>, order=core::sync::atomic::Ordering::Relaxed) at /home/rob/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/sync/atomic.rs:1081 #1 0x000000000012bc33 in spin::rw_lock::RwLock>::read> ( self=0x146568 <::deref::__stability::LAZY+56>) at /home/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/spin-0.4.6/src/rw_lock.rs:176 #2 0x000000000010cdb2 in rxinu::scheduling::cooperative_scheduler::{{impl}}::resched ( self=0x146538 <::deref::__stability::LAZY+8>) at src/scheduling/cooperative_scheduler.rs:102 #3 0x00000000001232f4 in rxinu::arch::x86::interrupts::irq::timer::{{closure}} () at src/arch/x86/interrupts/irq.rs:28 #4 0x0000000000115dcb in rxinu::arch::x86::interrupts::disable_interrupts_then ( uninterrupted_fn=...) at src/arch/x86/interrupts/mod.rs:15 #5 0x00000000001232a7 in rxinu::arch::x86::interrupts::irq::timer (_stack_frame=0x14bbc8) at src/arch/x86/interrupts/irq.rs:27 #6 0x0000000000128571 in alloc::vec_deque::VecDeque::grow_if_necessary ( self=0x146570 <::deref::__stability::LAZY+64>) at /home/rob/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/liballoc/vec_deque.rs:1755 #7 0x00000000001289e1 in alloc::vec_deque::VecDeque::push_back ( self=0x146570 <::deref::__stability::LAZY+64>, value=...) at /home/rob/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/liballoc/vec_deque.rs:1120 #8 0x000000000010cd61 in rxinu::scheduling::cooperative_scheduler::{{impl}}::ready ( self=0x146538 <::deref::__stability::LAZY+8>, id=...) at src/scheduling/cooperative_scheduler.rs:96 #9 0x000000000010d5e1 in rxinu::syscall::process::create (new_proc=0x10dc20 , name=...) at src/syscall/process.rs:9 #10 0x000000000010daea in rxinu::rust_main (multiboot_information_address=2585288) at src/lib.rs:58 #11 0x000000000010a0b3 in long_mode_start () Backtrace stopped: Cannot access memory at address 0x14c000 ```

Background

The reason that resched deadlocks when locks are held is that our context switch to a new process is an inner call in resched. Here's how the stack call looks when we run a new process:

new_process
rxinu::arch::context::switch_to
rxinu::scheduling::cooperative_scheduler::resched
rxinu::interrupts::irq::timer
rxinu::rust_main

Note that resched will always remain in the call stack until we get back to the null process.

The next question is "Why are locks held during the call to resched?" The way Rust implements locks is that they're held until drop is called on them. There are two ways to make sure drop is called for a lock: 1) Reach the end of the scope or 2) drop is manually called on the lock.

I debated whether to use solution 1) or 2) and ultimately decided to stick with solution 1). I reasoned that Rust as a language wants to limit manual management as much as possible. That's why Rust uses scoping and lifetimes by default. So I decided that rXinu will use scoping to free held locks.

Solution

So the issue is that I didn't properly scope the checking of the ready_list in resched. Because the ready_list lock is still active when we call context::switch_to, there will be a deadlock the next time we try to read from or write to ready_list.

Here's my fix:

-        // skip expensive locks if possible
-        if self.ready_list.read().is_empty() {
-            return;
+        {
+            // skip expensive locks if possible
+            if self.ready_list.read().is_empty() {
+                return;
+            }

Previously, we didn't run into this issue by chance. Thanks to your PIT driver, bugs in my scheduling code will be triggered more easily and can be fixed sooner!

Problem 2: Interrupting critical scheduling code

Now that we're using an irq handler to call resched, we need to ensure that the PIT irq handler doesn't invoke resched at incorrect times in our kernel.

Right now, important code, such as scheduling code using locks, can be interrupted. We need to disable interrupts while holding locks and re-enable interrupts when we are done.

My solution was to wrap the important scheduling components in the disable_interrupts_then function.

Problem 3: Resetting the PIT_TICKS value after resched

Currently, your PIT irq handler increments after every timer interrupt. It then checks to see if we've had 10 ticks, and if so we call resched.

The issue is that we need to reset PIT_TICKS to 0 after calling resched. Otherwise, PIT_TICKS will always be greater than 10 and we will call resched every clock tick.

diff --git a/src/arch/x86/interrupts/irq.rs b/src/arch/x86/interrupts/irq.rs
index 295108d..8d97885 100644
--- a/src/arch/x86/interrupts/irq.rs
+++ b/src/arch/x86/interrupts/irq.rs
@@ -18,12 +18,17 @@ fn trigger(irq: u8) {
 }

 pub extern "x86-interrupt" fn timer(_stack_frame: &mut ExceptionStack) {
+    use arch::x86::interrupts;
+
     pic::MASTER.lock().ack();

     if PIT_TICKS.fetch_add(1, Ordering::SeqCst) >= 10 {
+        PIT_TICKS.store(0, Ordering::SeqCst);
         unsafe {
-            SCHEDULER.resched();
-        } 
+            interrupts::disable_interrupts_then(|| {
+                SCHEDULER.resched();
+            });
+        }
     }
 }
robert-w-gries commented 6 years ago

Once I made the above changes, the scheduler worked as intended. I stress tested the scheduling code by creating a process cycle where process_a creates process_b which creates process_a until memory runs out. We are able to create and run 12 thousand processes before running out of memory.

However, there seems to be a significant performance hit by introducing the PIT interrupt handler. It takes 2-3 times as long to create and run the 12 thousand processes than when we ran resched in the null process.

It's not a serious issue since rXinu doesn't emphasize performance, but it is definitely something I will track in the issue tracker once your code is in. I believe it's related to wasted cycles either in the PIT driver or somewhere in the scheduling code.

toor commented 6 years ago

Yep, those changes to do with storing 0 to the ticks and wrapping reschedule calls in disable_interrupts_then() were things that I'd implemented in my own system but not in my commits to rxinu, oh well. I'll make the changes you've noted above, and then probably PR this since PIT functionality is critical to progress with the usermode PR. The performance issue is something I've noticed with my own code, and is definitely worth looking into.

robert-w-gries commented 6 years ago

Sounds good 👍 Thanks again for the help!

I haven't researched PIT drivers too much, so I'm interested in finding out how frequently your PIT driver triggers an interrupt. Is it every millisecond? Or every 0.5 seconds? This answer could explain why context switching takes so long.

toor commented 6 years ago

Triggers roughly every 2ms. It is certainly possible to make the PIT interrupt more frequently, however I've read that generally OS's use a divisor where the frequency is roughly 100Hz, or 10ms, more than what I'm using.