model-checking / kani

Kani Rust Verifier
https://model-checking.github.io/kani
Apache License 2.0
2.11k stars 83 forks source link

Valid value checks does not check `char` surrogate values #3241

Open celinval opened 3 months ago

celinval commented 3 months ago

I tried this code:

#[kani::proof]
fn transmute_surrogate_ub() {
    unsafe {
        let val: u32 = kani::any();
        kani::assume(val < char::MAX.into());
        let c: char = std::mem::transmute::<u32, char>(val) as char;
        match val {
            0..0xD800 | 0xE000..0x110000 => assert!(char::from_u32(val).is_some()),
            _ => unreachable!(),
        }
    }
}

using the following command line invocation:

kani -Zvalid-value-checks char.rs

with Kani version: 0.52.0

I expected to see this happen: A valid value check failure

Instead, this happened: An unreachable block being reached due to UB

Failed Checks: internal error: entered unreachable code: 
 File: "char.rs", line 31, in transmute_surrogate_ub

VERIFICATION:- FAILED
Verification Time: 0.12348909s
adpaco-aws commented 1 month ago

I've been investigating this issue and here's what I've found out.

First, the ValidValueReq we process for char specifies a valid_range that doesn't take into account the surrogate gap in the middle: [ValidValueReq { offset: 0, size: MachineSize { num_bits: 32 }, valid_range: 0..=1114111 }]

Note: 1114111 is 0x10FFFF (i.e., char::MAX)

I tried to special-case the ty_req closure in ty_validity_per_offset to return two valid ranges when a char is found:

diff --git a/kani-compiler/src/kani_middle/transform/check_values.rs b/kani-compiler/src/kani_middle/transform/check_values.rs
index a7d0f14d270..253963a4ef6 100644
--- a/kani-compiler/src/kani_middle/transform/check_values.rs
+++ b/kani-compiler/src/kani_middle/transform/check_values.rs
@@ -839,8 +839,21 @@ pub fn ty_validity_per_offset(
     ty: Ty,
     current_offset: usize,
 ) -> Result<Vec<ValidValueReq>, String> {
+    debug!(typ=?ty, kind=?ty.kind(), "ty_validity_per_offset");
     let layout = ty.layout().unwrap().shape();
     let ty_req = || {
+        if ty.kind() == TyKind::RigidTy(RigidTy::Char) {
+            let shape = ty.layout().unwrap().shape();
+            let value_size = match shape.abi {
+                ValueAbi::Scalar(Scalar::Initialized { value, .. })
+                | ValueAbi::ScalarPair(Scalar::Initialized { value, .. }, _) => {
+                    Some(value.size(machine_info))
+                }
+                _ => None,
+            };
+            return vec![ValidValueReq { offset: 0, size: value_size.unwrap(), valid_range: WrappingRange { start: 0, end: 0xD7FF }},
+            ValidValueReq { offset: 0, size: value_size.unwrap(), valid_range: WrappingRange { start: 0xE000, end: 0x10FFFF }}]
+        };
         if let Some(mut req) = ValidValueReq::try_from_ty(machine_info, ty)

However, that didn't have the intended effect as each range check is being encoded separately. In other words, this change will generate two separate checks as follows:

assert!((0..=0xD7FF).contains(value));
assert!((0xE000..0x10FFFF).contains(value));

which doesn't work well because it's requiring value to be in BOTH ranges.

It's not clear to me that the check generation should be done this way for multiple ranges. For example, in this code

                SourceOp::BytesValidity { ranges, target_ty, rvalue } => {
                    let value = body.new_assignment(rvalue, &mut source, InsertPosition::Before);
                    let rvalue_ptr = Rvalue::AddressOf(Mutability::Not, Place::from(value));
                    for range in ranges {
                        let result = build_limits(body, &range, rvalue_ptr.clone(), &mut source);
                        let msg =
                            format!("Undefined Behavior: Invalid value of type `{target_ty}`",);
                        body.add_check(
                            tcx,
                            &self.check_type,
                            &mut source,
                            InsertPosition::Before,
                            result,
                            &msg,
                        );
                    }
                }

in which cases does it make sense to generate more than one check?

Another option is to extend ValidValueReq to be able to work with multiple ranges. This seems to be nontracked pending work according to the comments in the source code. Maybe replacing/wrapping it with another structure like RangeSet could save us work here.

adpaco-aws commented 1 month ago

Unfortunately, the option of extending ValidValueReq is more involved than I initially expected for a couple reasons:

  1. It's based on WrappingRange which comes from the ABI API of StableMIR.
  2. WrappingRange doesn't consider gaps within a range like the ones that we'd need for char surrogate values.

It's not clear to me such changes would pay off if char is the only type with a gap.

Would it be OK to change the range loop above so that build_limits constructs a single check that ensures the value falls within one of the ranges? I've just realized this wouldn't be OK if the ranges have different offsets, but it doesn't make sense to generate multiple checks for a particular offset.