googlefonts / fontations

Reading and writing font files
Apache License 2.0
375 stars 23 forks source link

Soundness of read_ref_unchecked #436

Open danakj opened 1 year ago

danakj commented 1 year ago

read_ref_unchecked will create a reference into a memory buffer, changing the representation from &[u8] to &T.

The number of bytes used for this &[u8] is T::RAW_BYTE_LEN which is defined to be the number of bytes required to represent the type in a font file, which may be smaller than the size of the native type when the native type has tail padding (for alignment purposes).

There are 2 possible problematic outcomes:

Overlapping refs

Two consecutive T elements in the buffer may overlap in memory. Let's say T has size_of equal to 2, but T::RAW_BYTE_LEN is 1.

| T1 | T2 | T3 | ... |

If we were to *read_ref_unchecked(0) = another_t, we'd expect size_of<T> bytes to be memcpy'd which would clobber T2.

This is happily mitigated by the fact that there is no mutable reference version of this function. I think we should document why there is no mutable reference version somewhere in the code (rustdocs) so that one does not appear in the future.

Invalid refs

The last T element in the buffer actually goes off the end of the buffer. Again, let's say T has size_of equal to 2, but T::RAW_BYTE_LEN is 1.

| T1 | T2 | ... | T9 |

A call to read_ref_unchecked(8) will return &T that points to | T9 | invalid memory |. If that &T is moved (which it can't be, see Overlapping refs) or copied, we'd see two bytes of memory copied, which would involve a read out of bounds, which is a memory safety bug.

I think this could be mitigated by requiring T to not be Copy, or by requiring size_of<T>() - T::RAW_BYTE_LEN bytes of padding to exist at the back of the buffer?

rsheeter commented 1 year ago

Analysis much appreciated. QQ, what was overlapping refs meant to link to? - currently it takes me back here.

cmyr commented 1 year ago

wonderful, happy to have someone more comfortable than I am with the soundness rules looking at this stuff.

For the overlapping refs issue: I think the problem here is that I am overloading the FixedSize trait; it is both communicating the size required to encode a given type, but I'm also using it here as a kind of marker trait. In practice it is only used with types that are #[repr(packed)] and contain only fields that are BigEndian<T>, which is essentially a type annotation on top of a fixed-size byte array.

For this set of types, sizeof:: and T::RAW_BYTE_LEN are always equal.

So in summary, my understanding is really that if I can guarantee that sizeof:: and T::RAW_BYTE_LEN are equal for all input types, then we should be okay?

dfrg commented 1 year ago

This is a great catch. I think we’re safe here by convention but it would be nice to use the type system instead. Can we just get rid of RAW_BYTE_LEN and use size_of instead? The Uint24 type is the one case I’m aware of with a discrepancy but we can just change the payload to [u8; 3]. Are there others?

danakj commented 1 year ago

QQ, what was overlapping refs meant to link to?

Ah, meant to link the first header on this issue :) My bad, forgot to test it.

cmyr commented 1 year ago

@danakj I've tightened up enforcement of our invariants, and I believe your concerns here should be addressed.

tldr:

I believe that our constraints are strictly narrower than are required (for instance, we could allow padding) however the reality of the data we are working with is that it is always packed, so it seems reasonable to just enforce that.

Some further thoughts:

In any case ptal when you have a free moment, and let me know if you've any questions or concerns.

Thanks again for this, very reassuring to have another pair of eyes on this stuff.

danakj commented 1 year ago

The FromBytes trait says it should only be used from generated code, which I believe is also all inside the same crate, is that right?

If so, you can actually prevent anything else from implementing it - which also lets you make changes to it without requiring a major version bump.

https://rust-lang.github.io/api-guidelines/future-proofing.html

danakj commented 1 year ago

Why is impl FromBytes for BigEndian<T> not conditional on T: FromBytes ?

cmyr commented 1 year ago

The FromBytes trait says it should only be used from generated code, which I believe is also all inside the same crate, is that right?

If so, you can actually prevent anything else from implementing it - which also lets you make changes to it without requiring a major version bump.

https://rust-lang.github.io/api-guidelines/future-proofing.html

The issue here, as I understand it, is that we would need to implement the sealed trait for each of the generated types, but the generated types are declared in various different modules which are not the module where the trait is declared.

Maybe I'm just taking the examples too literally, though, and we could do something like,

// in crate::read::FromBytes
trait FromBytes: sealed::Sealed {}
pub(crate) mod sealed {
    pub trait Sealed {}
}

// in generated code
/// SAFETY: see the [`FromBytes`] trait documentation.
impl crate::read::sealed::Sealed for MyGeneratedType {}
unsafe impl FromBytes for MyGeneratedType {}

Does that make sense? If so I'll get a CL together.

Why is impl FromBytes for BigEndian<T> not conditional on T: FromBytes ?

Because T doesn't need to be FromBytes; BigEndian<T> does not store T. Instead it is a transparent wrapper around [u8; N], where T is essentially a marker.

This is enforced in a slightly roundabout way:

Essentially BigEndian<T> is a way of saying "I am raw bytes, and I can be converted to and from T". The rationale for this design was largely ergonomics: it lets us use normal native scalars in most places, instead of needing to be generic over byte order.

Another property of this implementation is that we don't need T to be FromBytes itself. For instance we have a UInt24 type that is internally a u32, but the BigEndian<UInt32> is a 3-byte array; we just ignore the extra byte when reading/writing.

danakj commented 1 year ago

Does that make sense? If so I'll get a CL together.

Yeah, that sealed code makes sense to me. That way nothing outside the crate gets to implement it, which is the intent. I think the doc example could include a pub(crate) to make that more clear.

Thanks for the BigEndian explanation, that makes sense.

danakj commented 1 year ago

https://github.com/googlefonts/fontations/blob/368ca473c5868a0d527038bf1b96abda708dc741/read-fonts/src/font_data.rs#L120

Could you add a comment here explaining why this is able to check for internal padding? I believe it because of codegen reasons and the addition of type sizes recursively but it's not obvious from here.

danakj commented 1 year ago

https://github.com/googlefonts/fontations/blob/368ca473c5868a0d527038bf1b96abda708dc741/read-fonts/src/font_data.rs#L159 This could use the same

danakj commented 1 year ago

I believe the soundness holes have been plugged from safe code, thank you!

danakj commented 1 year ago

read_array_unchecked is never actually used in this codebase, presumably it's important to some future code?

cmyr commented 1 year ago

Does that make sense? If so I'll get a CL together.

Yeah, that sealed code makes sense to me. That way nothing outside the crate gets to implement it, which is the intent. I think the doc example could include a pub(crate) to make that more clear.

sounds good, will put a patch together.

https://github.com/googlefonts/fontations/blob/368ca473c5868a0d527038bf1b96abda708dc741/read-fonts/src/font_data.rs#L120

Could you add a comment here explaining why this is able to check for internal padding? I believe it because of codegen reasons and the addition of type sizes recursively but it's not obvious from here.

Your reasoning is correct, but I will add a comment.

read_array_unchecked is never actually used in this codebase, presumably it's important to some future code?

Correct, I believe none of our unsafe fns are currently called directly; they are only called from inside the safe versions.

Whether this changes is ultimately a matter of profiling. The design of this crate is such that for all of these raw-bytes types, we do a single validation pass when we first read them; if this pass succeeds, we know that any subsequent member access is in bounds, which means we should soundly be able to use the unsafe versions of these methods, if we wish. In practice we haven't bothered playing with this yet, since we have not gotten to the point where we are trying to squeeze out small performance gains, but I can definitely imagine a case where we might (for instance if we end up using this code as the foundation for a new shaping engine.)

In the meantime, though, I think we can at least mark the unsafe fns as not-pub, which should avoid any confusion.

danakj commented 1 year ago

Honestly leaving them pub is fine, I wanted to make sure I was not missing review of any unsafe code (the callers of those). They will be visible when they exist thanks to the unsafe block required.