googlefonts / oxidize

Notes on moving tools and libraries to Rust.
Apache License 2.0
173 stars 7 forks source link

Initial benchmark results and subsequent thinking #14

Open cmyr opened 2 years ago

cmyr commented 2 years ago

I spent a good bit of time last week experimenting with different implementation approaches, and writing up some simple benchmarks. A summary of the benchmark results are available here and they can be run locally by checking out the benchmarks branch of the test repo and running cargo bench.

I found myself getting lost in the details when trying to write up my conclusions, so I want to sort of pull myself away from that and focus on the big picture, which is basically:

toy_types_derive::tables! {
    Maxp05 {
        #[version(0x00005000)]
        version: Version16Dot16,
        num_glyphs: uint16,
    }

    Maxp10 {
        #[version(0x00010000)]
        version: Version16Dot16,
        num_glyphs: uint16,
        max_points: uint16,
        max_contours: uint16,
        max_composite_points: uint16,
        max_composite_contours: uint16,
    }

    // this creates an enum
    #[versioned(Version16Dot16)]
    table Maxp {
        Version0_5(Maxp05),
        Version1_0(Maxp10),
    }

    table Cmap {
        version: uint16,
        num_tables: uint16,
        #[count = "num_tables"]
        encoding_records: [EncodingRecord],
    }
}

From here we would build up descriptions of the various tables and their fields, and we could use that to generate different concrete implementations. This will be reusable, and will also make it easier for us to iterate.

tl;dr: I'm going to keep experimenting, leaning on zerocopy as much as possible. Let's see what that looks like.

More detail

Over the past week or so I have been experimenting with various approaches to accessing and representing data in font tables, as well as with using Rust's procedural macro system to generate code to do the actual accessing.

Concretely, I have experimented with three different approaches: native types, view types, and zero-copy casts.

Overall, these approaches are quite similar.. They differ only in how they handle scalar values, and types that are composed entirely of scalar values. For much of the rest of their implementations, they are identical.

I want to evaluate these different options on our key axes: safety/performance, and ease-of-use, for both end-users and future maintainers.

Notes on the implementations

I really did end up focusing on microbenchmarks. This means that we're largely answering questions like, "if we know the offset of a given structure, how long does it take us to safely access some part of that structure?".

They aren't optimal. I'm using a single derive macro to describe the shapes of the plain-old-data versions of things, and then at the same time I am generating a view version of that structure. There are various ways I could improve the code generation, but a major one is that the views in particular are not bounds-checked when they're created, which means each field access of a view is bounds checked; all the view cases in these benchmarks are slower than they could be as a result.

The zerocopy versions were not generated by a macro; I wrote these by hand.

Findings

I'm going to provide a quick summary of the main takeaways, and then I'll go into a bit more detail around some specific topics below.

zerocopy is great for anything that is all scalars.

I think we should have a core set of 'raw' types, that all work with zerocopy. This will be a lot like HarfBuzz; behind the scenes we will be dealing with fixed-size arrays of bytes.

The interface for the zerocopy crate is based around a number of traits, most importantly [FromBytes][] (for casting from bytes) and [AsBytes][], for casting to bytes. Implementing these traits is trivial (through derive) but they are only available for types that have a constant, known size.

So: I think that for anything that is all-scalar, using zerocopy directly makes sense.

this works great:

#[derive(zerocopy::FromBytes)]
#[repr(C)]
struct EncodingRecord {
    platform_id: uint16, // these are our raw types, essentially [u8; 4] in BE
    encoding_id: uint16,
    subtable_offset: Offset32,
}

things that are variable sized are trickier.

Let's take a simple example, the cmap header.

Type Name
uint16 version
uint16 numTables
EncodingRecord encodingRecords[numTables]

We would love to be able to create something like the following:

struct Cmap {
    version: uint16,
    num_tables: uint16,
    encoding_records: SomeArrayThing<EncodingRecord>, // Something??
}

The problem is with the array, where the length depends on num_tables. This means that it cannot work with zerocopy directly; zerocopy will not transmute into something with an unknown size.

We have a number of options for how to handle these cases.

view

We could just do the 'view' strategy, and expose the fields exclusively through methods, which would look something like,

struct Cmap<'a>(&'a [u8]);

impl<'a> Cmap<'a> {
    // we make sure that the data makes sense at construction time
    fn new(table_data: &'a [u8]) -> Option<Self> {
        let num_tables = table_data.read(2..4)?.into();
        let expected_len = 4 + num_tables as usize * std::mem::size_of::<EncodingRecord>();
        if table_data.len() < expected_len {
            None
        } else {
            Some(Self(table_data))
        }
    }

    fn version(&self) -> uint16 {
        // can't fail, length checked in constructor
        // we could unsafe here if we want
        self.0.read(0..2).unwrap()
    }

    fn num_tables(&self) -> uint16 {
        self.0.read(2..4).unwrap()
    }

    fn encoding_records(&self) -> &'a [EncodingRecord] {
        let len = self.num_tables() as usize * std::mem::size_of::<EncodingRecord>();
        let record_bytes = self.data.get(4..4 + len).unwrap();
        zerocopy::LayoutVerified::new_slice_unaligned(record_bytes).unwrap().into_slice()
    }
}

The downside here is that if we want to avoid the unsafe keyword, we'll be doing a bunch of bounds checking; an additional downside is that you don't have a nice struct that just lists all of its fields.

doing more parsing up front

In that last example, the encoding_records method has to do bounds checking each time it's called. It would be nice if we could just do this once, up front; this would pay for itself quite quickly.

For the other fields, we could either copy them over at the same time, or we could continue to expose them via methods. I'll illustrate the first option, since the second option looks the same as the previous example:

struct Cmap<'a> {
    version: uint16,
    num_tables: uint16,
    encoding_records: &'a [EncodingRecord],
    subtable_data: &'a [u8],
}

doing codegen

For this set of experiments, I used 'derive-style' macros. These have a major limitation: they cannot modify their input, they can only generate additional items (implementation blocks, structs) as output. The problem is that if we are generating view types, derive doesn't really work (at least not idiomatically) since we don't actually want there to be fields on the types that we're describing.

I think that the next approach is to try function-like macros, which basically let us describe an arbitrary DSL. We can use this to implement a sort of table-description language; we can then take these table descriptions and use them to generate whatever code we need. Importantly we'll be able to reuse the same descriptions to generate types for the parsing case as well as owned types that can be used during compilation.

rsheeter commented 2 years ago

One small point: zerocopy doesn't have to mean the crate named zerocopy.

My intuition is that procmacros will significantly widen what we can automate parsing for, in particular extending to reading things where some field says the number of things to find in another field. Doesn't get us 100% but perhaps it gets us pretty far :)

QQ, in this example why aren't you using zerocopies BE types?

// Example from this issue
#[derive(zerocopy::FromBytes)]
#[repr(C)]
struct EncodingRecord {
    platform_id: uint16, // these are our raw types, essentially [u8; 4] in BE
    encoding_id: uint16,
    subtable_offset: Offset32,
}

// When I tried Zerocopy I ended up with the following
// Ref https://github.com/rsheeter/rust_read_font/blob/main/mk1/src/main.rs
#[repr(C)]
#[derive(FromBytes, AsBytes, Unaligned)]
#[derive(Debug, Clone)]
struct cmapEncodingRecord {
    platform_id: U16<BigEndian>,
    encoding_id: U16<BigEndian>,
    subtable_offset: U32<BigEndian>,
}

generate types for the parsing case as well as owned types that can be used during compilation.

Aren't the zerocopy structs capable of mutation and conversion to bytes as well, meaning the compiler can create and manipulate them rather than using entirely different types?

the reader cannot easily interpret from the macro what code is going to be generated

We can document the commands to materialize the proc macro output and perhaps that substantially mitigates, particularly if we are somewhat careful to emit legible code?

cmyr commented 2 years ago

One small point: zerocopy doesn't have to mean the crate named zerocopy.

yes, good thing to be clear on. That said, the zerocopy crate seems (so far) to be the best fit for us in terms of implementing the zerocopy pattern.

My intuition is that procmacros will significantly widen what we can automate parsing for, in particular extending to reading things where some field says the number of things to find in another field. Doesn't get us 100% but perhaps it gets us pretty far :)

In my exploring this is working fine with derive style, but definitely moving to function-like proc macros will give us more or less arbitrary control.

QQ, in this example why aren't you using zerocopies BE types?

Oops yes to clarify in all these examples, all scalar types are assumed to be raw BE bytes.

Aren't the zerocopy structs capable of mutation and conversion to bytes as well, meaning the compiler can create and manipulate them rather than using entirely different types?

Mutation is possible in theory, although given how complicated the structure of the graph is this might start to bump up against rust's aliasing rules. I haven't investigated this, yet.

Assuming it is possible, though, it will still be limited: for instance you won't be able to add a new item to an array. That sort of construction will require something else.

the reader cannot easily interpret from the macro what code is going to be generated

We can document the commands to materialize the proc macro output and perhaps that substantially mitigates, particularly if we are somewhat careful to emit legible code?

Yes, I definitely think documentation will be important.

For legibility: the code that we generate will be hidden from the user, except to the extent that it will show up in the documentation.

rsheeter commented 2 years ago

all scalar types are assumed to be raw BE bytes

I suggest this would be far more obvious if the type was very explicitly about being BE. U16<BigEndian> is rather clearly BE, a comment and then a uint16 ... I'm bad enough at reading to miss the comment, interpret the field as uint16_t (yes I know, wrong language), and then become confused when skimming quickly.

Assuming it is possible, though, it will still be limited: for instance you won't be able to add a new item to an array.

Certainly. Still a pretty powerful building block for both a subsetter and a compiler. For example, I could make a mutable vector of a zerocopy type or accumulate refs or slices, etc.

the code that we generate will be hidden from the user

To be slightly more specific, I want to be able to cargo expand (or equivalent if there are other tools to do that job) and have the code our macros produce - particularly proc macros - be human readable.