fix out-of-bounds access bug

dkuehr commented 2 years ago

Unsafe access to the first vector element when the vector was empty was causing a segmentation fault when using fuzzcheck. This patch fixes it.

Thread 3.1 "action_fuzz-b00" received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7ffff7a453c0 (LWP 574366)]
0x0000555555bb1c7c in <fuzzcheck::code_coverage_sensor::CodeCoverageSensor as fuzzcheck::traits::Sensor>::iterate_over_observations ()
(gdb) bt
#0  0x0000555555bb1c7c in <fuzzcheck::code_coverage_sensor::CodeCoverageSensor as fuzzcheck::traits::Sensor>::iterate_over_observations ()
#1  0x00005555557d487a in <fuzzcheck::sensors_and_pools::and_sensor_and_pool::AndPool<fuzzcheck::sensors_and_pools::simplest_to_activate_counter_pool::SimplestToActivateCounterPool, fuzzcheck::sensors_and_pools::and_sensor_and_pool::AndPool<fuzzcheck::sensors_and_pools::and_sensor_and_pool::AndPool<fuzzcheck::sensors_and_pools::most_n_diverse_pool::MostNDiversePool, fuzzcheck::sensors_and_pools::optimise_aggregate_stat_pool::OptimiseAggregateStatPool<fuzzcheck::sensors_and_pools::optimise_aggregate_stat_pool::NumberOfActivatedCounters>>, fuzzcheck::sensors_and_pools::and_sensor_and_pool::AndPool<fuzzcheck::sensors_and_pools::maximise_counter_value_pool::MaximiseCounterValuePool, fuzzcheck::sensors_and_pools::optimise_aggregate_stat_pool::OptimiseAggregateStatPool<fuzzcheck::sensors_and_pools::optimise_aggregate_stat_pool::SumOfCounterValues>>>> as fuzzcheck::traits::CompatibleWithSensor<fuzzcheck::code_coverage_sensor::CodeCoverageSensor>>::process ()

loiclec commented 2 years ago

Oh, that's interesting. I shouldn't have been accessing the first element without checks. On the other hand, I am very curious to find out why that vector was empty, as it breaks some of my assumptions. Is there a way for me to reproduce it?

In this code, there are two loops.

The first loop iterates over some actual, physical counters (somewhat badly named single_counters, they are those set up by rustc with the -Z instrument-coverage option).
The other loop iterates over “derived/computed” counters (named expression_counters). Those counters are given by expressions over the single_counters.

It is much clearer with an example. With the following function:

fn foo() {
    // c0
    if x() {
        // c1
    } else {
        // c2
    }
}

We have a total of 3 instrumented code regions: c0, c1, and c2. The Rust compiler creates an array of u64 somewhere in memory to store the coverage hits for each code region. But this array only contains two elements:

[c0, c1]

And the instrumented function foo is compiled to something equivalent to:

fn foo() {
    array[c0] += 1;
    if x() {
        array[c1] += 1;
    } else {

    }
}

And somewhere in the executable file, it stores a message describing how to obtain the hit count for the c2 code region, which essentially says:

c2 = c0 - c1

When the CodeCoverageSensor is created, it looks for the “physical” array of counters (i.e. c0 and c1) as well as the message in the executable that gives the “derived” counters (i.e. c2). The former one is stored in coverage.single_counters and the latter one is given by coverage.expression_counters.

Anyway, what seems to be happening here is that there is an instrumented function which has no physical counters at all, which surprises me a little bit. It might be that there is a bug in the construction of CodeCoverageSensor, or it may be that this is actually perfectly normal.

But if it is indeed normal, then I think it would make more sense to filter out the function entirely during the construction of CodeCoverageSensor, so that the fuzzcheck never needs to iterate over it at all.

dkuehr commented 2 years ago

Thank you for the detailed explanation :)

The issue was triggered while fuzzing functionality from TezEdge, but I wasn't able to produce a standalone PoC program that reproduces the issue.

The patch I provided was enough to avoid the issue, but as you mention it isn't the best solution.

loiclec commented 2 years ago

That's ok, thanks for identifying the issue! I'll release an update soon with a fix, whether I understand the root cause or not :)

Would you mind adding a panic statement right before the out-of-bounds access:

let mut index = 0;
for coverage in coverage {
    if coverage.single_counters.is_empty() {
        panic!(
            "{} {} {:?} {:?}",
            index,
            coverage.expression_counters.len(),
            coverage.function_record.name_function,
            coverage.function_record.filenames,
        )
    }
     // ...

to see if it panics, and tell me what is printed?

dkuehr commented 2 years ago

Here are the results:

thread 'main' panicked at '19683 1 "_RNCINvNtNtCs5i7zDO3ioyK_14tezos_messages3p2p14binary_message28all_consuming_complete_inputNtNtNtB6_8encoding10connection17ConnectionMessageNvYB1q_NtNtCs33WiA1oqOVT_14tezos_encoding3nom9NomReader8nom_readE0B8_" ["tezos/messages/src/p2p/binary_message.rs"]', /home/dan/.cargo/git/checkouts/fuzzcheck-rs-3f36d9a671138ac4/7abcc13/fuzzcheck/src/code_coverage_sensor/mod.rs:113:21

It seems there is at least 1 expression counter. That code is derived by procedural macros, but if you really need it I could dig out what is the final code generated.

This is an interesting issue because macro-generated code has been problematic with lcov coverage reports, and for debugging itself. Basically at "symbol" level everything gets hidden behind macros instead of showing the expanded code. Furthermore attempting something like cargo-expand and rebuilding the expanded code is not possible: https://github.com/dtolnay/cargo-expand#disclaimer

loiclec commented 2 years ago

Thanks a lot! I had a quick look and It seems to be a problem caused by debug_assert!. This function also has the same problem:

fn foo(x: &Option<bool>) {
    x.map(|x| {
        debug_assert!(x, "x is true");
        x
    });
}

The optimised MIR output is:

fn foo::{closure#0}(_1: [closure@src/lib.rs:5:11: 8:6], _2: bool) -> bool {
    debug x => _2;                       // in scope 0 at src/lib.rs:5:12: 5:13
    let mut _0: bool;                    // return place in scope 0 at src/lib.rs:5:15: 5:15

    bb0: {
        Coverage::Expression(4294967294) = 3 - 1; // scope 0 at /rustc/6d246f0c8d3063fea86abbb65a824362709541ba/library/core/src/macros/mod.rs:215:26: 215:56
        Coverage::Counter(2);            // scope 0 at /rustc/6d246f0c8d3063fea86abbb65a824362709541ba/library/core/src/macros/mod.rs:215:23: 215:87
        Coverage::Expression(4294967295) = 1 + 2 for src/lib.rs:7:9 - 8:6; // scope 0 at src/lib.rs:8:6: 8:6
        _0 = _2;                         // scope 0 at src/lib.rs:7:9: 7:10
        Coverage::Unreachable for src/lib.rs:6:23 - 6:24; // scope 0 at /rustc/6d246f0c8d3063fea86abbb65a824362709541ba/library/core/src/macros/mod.rs:215:59: 215:84
        return;                          // scope 0 at src/lib.rs:8:6: 8:6
}

There are two things that catch my attention:

Coverage::Expression(4294967294) is an expression of counter 3 and 1, which don't seem to exist in this MIR.
I don't remember taking Coverage::Unreachable into consideration, and I don't know how it translates to LLVM's coverage mapping, will need to look into it.

So it is either a bug in rustc, or a bug in fuzzcheck, or both. I also think it may cause other silent bugs and possible out-of-bounds accesses.

I will post updates here when I get to it, thank you! :)

dkuehr commented 2 years ago

Great job thanks! Could it be that other counters are removed by a DCE pass? If my interpretation is correct, expressions reference counters just their u32 identifier but can the compiler actually know a counter is referenced by an expression just by that value to avoid elimination?

loiclec commented 2 years ago

Yes, I think that's what's happening! I had a look at the code again yesterday. When a counter is unreachable. its “kind” changes to Zero.

But Fuzzcheck already handles that, as all counters of kind Zero are just ignored. What I didn't foresee are some odd interactions between physical, expression, and Zero counters. Going back to the same function:

fn foo() {
    // c0
    if x() {
        // c1
    } else {
        // c2
    }
}

Imagine that, for some reason, c0 and c1 become unreachable. Then that could be encoded in the executable as:

physical_counters = [0, 0]
counters_with_code_regions = [
    { kind: Zero, region: "..." },
    { kind: Zero, region: "..."}, 
    { kind: Expression { op: Subtract, idx: 0 }, region: "..." }
]
expressions = [
    { lhs = physical_counters[0], rhs = physical_counters[1] }
]

The problem is that fuzzcheck gets its information about what counters matter from the counters_with_code_regions array. When it reads that, it sees that there are no physical counters to observe. Which is why the single_counters field was empty.

But I am a bit surprised that the physical_counters array is not empty, and that the Expression still exists. It seems to me that the compiler should have replaced it with Zero as well, since it only references other Zeros. As far as I can tell, there is no way for me to deduce that reliably, because nowhere in this message does it say that physical_counters[0] and physical_counters[1] are always 0 (no index is attached to Zero counters). It is not a bug in the compiler per se, but it is slightly annoying.

But anyway, it is pretty easy to fix, since fuzzcheck doesn't do anything fundamentally wrong. It's just an edge case to consider. I'll try and release an update by next week!

And sorry for all the unnecessary details, I am using these issues and pull requests to flesh out my thoughts sometimes :)

Side note:

Since you also mentioned that you had problems with procedural macros, I dug into that and found that fuzzcheck does indeed ignore a lot of code inside macros. However, it appears to be fixable.

While there is no information that links counters to specific code regions, the counters exist and can be used for fuzzing.

If I stop looking at the counters_with_code_regions array to know which counters matter and instead consider all physical counters and all (sub)expressions, then macros can be instrumented. This comes, however, at the cost of an explosion in the number of observed counters. So I have been working on a solution to minimise the set of observed counters without losing important feedback. In my tests, that can reduce the number of counters in pulldown-cmark from ~2700 to ~1500, which is great for performance. I am hoping to ship this at the same time as this bug fix :-)

loiclec commented 2 years ago

I realised I forgot to confirm that the bug has been fixed. Thanks for reporting it in such a detailed way!

loiclec / fuzzcheck-rs

fix out-of-bounds access bug #20