Some questions about Using Klee for Rust for Linux

StevenJiang1110 commented 3 years ago

I have read your post Using klee on Rust for Linux(part 1). I also keep notice on the project and your post inspires me a lot. But I still have some questions. Since I can found no discussion areas for the post. I put my questions here.

Based on my understanding, the Rust for Linux project is aimed to mitigate the gap between the current situation of Rust and the requirements of kernel development. As I learned from your post, the main problem is aroused by dynamic checks(usually, automatically added by Rust compiler) that will panic the program if the check is not passed. The safety of Rust programs partly relies on these dynamic checks. Verification tools try to eliminate the effects of dynamic checks by two ways: fix check that may fail and remove check that cannot fail. This can be done by an independent verification step before release. Have I got your ideas correctly?
In the subsection Safety checks of section What properties would we like to verify?. You said "if we try to catch the panic, we will risk leaving data in an inconsistent way". I do not know a lot about panic safety. But I guess that leaving data in an inconsistent way indicates a bug that we can avoid if we write codes carefully. Do you mean that this kind of bugs can be easily introduced if we try to handle panics? Therefore trying to handle panics is not a good practice.

Thanks a lot!

alastairreid commented 3 years ago

Hi Jianfeng Jiang,

Yes, (1) is absolutely right.

The problem referred to in (2) is what to do in a piece of code like this if the code panics on line 5

unlock(L);
f = open("foo");
p = malloc(100);
q = malloc(100);
if (size >= 65536) panic();
// do something with p and q
free(p);
free(q);
close(f);
lock(L);

At the point where the panic occurs, the lock "L" is unlocked, the file "f" is open, p and q are allocated.

Our options for what to do when the panic occurs are

1) Do a proper cleanup by not panicking and, instead, jumping to the first "free" on line 7. In other words, do not call panic() or use code that calls panic().

2) Somewhere higher up the call stack, catch the panic, perhaps print an error message and continue. The problem with this is that code higher up the call stack does not have enough information to clean up the problem before continuing. It does not have the values of the local variables f, p and q and it does not know where the panic occurred. It will leak the objects p and q, leave L unlocked (leading to a possible deadlock later) and will leave "f" open and therefore inaccessible. Worse, if the code was part of the way through inserting something into a linked list (say), the linked list might be corrupted. The result is that, sometime later, the system will either corrupt data or deadlock or crash or something like that.

3) When a panic occurs, reset the entire machine This avoids the data corruption of option (2) but at the cost of data loss and loss of service. Any data not yet saved to disk will be lost. If another machine relies on the one that you reset, the other machine may be affected. (eg if my desktop machine fails, I can't play music in my house) If the computer is a medical device or controls the car brake system there is a potential threat to life. etc.

4) If the panic is inserted by the compiler (eg array bounds checks) and we want to avoid (2) and (3), we could disable those checks in the compiler. This is not a great option: omitting array bounds checks opens up security vulnerabilities.

5) If the panic is inserted by the compiler (eg array bounds checks) and we want to avoid (2) and (3), we could do something to convince ourselves that the panic can never occur. It would then be ok to leave the checks enabled or to disable the checks because the checks can never fail.

We can use testing, fuzzing or some kind of formal verification to

convince ourselves that the panic cannot occur.

 What is most likely to happen though is that, out of 1000 checks

inserted by the compiler,

It is clear to the compiler that 900 of them cannot fail - the compiler deletes the checks
Testing, fuzzing or formal verification finds that 10 of them can fail - we fix those bugs
A formal verification tool proves that 80 of them cannot fail
We are left with 10 checks that we are unsure about. For these, we either
- rewrite the code to be simpler
- or we insert explicit checks with proper cleanup code around it (this is often enough to persuade the compiler or verification tool that the compiler-inserted check cannot fail).
- or we increase the amount of code review and testing used on that part of the code. In this case, it is useful to document these checks very carefully to explain why we think the check cannot fail.

In short, the only safe options are (1) and (5): don't use panic in code we write and work very hard to show that any compiler-inserted checks cannot fail.

I hope that makes sense?

-- Alastair

On Thu, Sep 2, 2021 at 6:23 AM Jianfeng Jiang @.***> wrote:

I have read your post Using klee on Rust for Linux(part 1). I also keep notice on the project and your post inspires me a lot. But I still have some questions. Since I can found no discussion areas for the post. I put my questions here.

Based on my understanding, the Rust for Linux project is aimed to mitigate the gap between the current situation of Rust and the requirements of kernel development. As I learned from your post, the main problem is aroused by dynamic checks(usually, automatically added by Rust compiler) that will panic the program if the check is not passed. The safety of Rust programs partly relies on these dynamic checks. Verification tools try to eliminate the effects of dynamic checks by two ways: fix check that may fail and remove check that cannot fail. This can be done by an independent verification step before release. Have I got your ideas correctly?

In the subsection Safety checks of section What properties would we like to verify?. You said "if we try to catch the panic, we will risk leaving data in an inconsistent way". I do not know a lot about panic safety. But I guess that leaving data in an inconsistent way indicates a bug that we can avoid if we write codes carefully. Do you mean that this kind of bugs can be easily introduced if we try to handle panics? Therefore trying to handle panics is not a good practice.

Thanks a lot!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/project-oak/rust-verification-tools/issues/152, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXRJSM7ZZX4BFMQLS6AGSTT74C4TANCNFSM5DIGPZFQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

StevenJiang1110 commented 3 years ago

Thanks a lot for your detailed response. Now I can understand why panic will not be a wise choice in reliability-focused situations(where reset is not an option at most times, like linux kernel). It seems implicit panics introduced by Rust compiler is not a good practice in these situations. These panics are always overlooked by developers and will cause panic safety problems. For efficiency-focused situations, eliminating useless dynamic checks will also help improve efficiency. This sounds like an interesting research area.

Thanks again！

project-oak / rust-verification-tools

Some questions about Using Klee for Rust for Linux #152