A hybrid, safe approach

cmyr commented 2 years ago

the important question part

How much, exactly, do we dislike copying? Below I am going to outline an approach that copies/converts scalar fields greedily, as objects are accessed. With this approach, we would still avoid a lot of copying (we would only ever copy bytes on tables that we accessed, and there still quite minimally) and in exchange for spending a few cycles up front, we end up with a much more natural API.

the explanation part

So far, we have been discussing two different take on a zerocopy API: first is the HarfBuzz model, where we take some chunk of bytes and directly reinterpret them as some type T that has fields we can access; and second is the pinot model, where instead of a struct with fields we have a 'view' type that generates methods for each field, and those methods reach into the bytes and interpret them as a given scalar value.

I like different parts of each of these approaches. For the HarfBuzz approach, I like that we end up with a struct that acts like any other (modulo the fact that you can't directly read the fields; you need to call getters) and which also is a more-or-less literal transcription of the items in the spec. For the pinot version, I like that there is no transmute (which is more heavily restricted in Rust than in cpp, as I understand it) and perhaps also that it lets the overall API (perhaps) be more consistent, because some things will probably make more sense as methods anyway.

I've been thinking about an approach I'll call 'copy leaves'. In this approach, in a given object, we distinguish between fields that are scalars and those that are collections/references. We always copy scalars, and we always create views into references.

In this world, a declaration of fvar would look something like,

#[derive(FontThing)]
pub struct Fvar<'a> {
    blob: Blob<'a>,
    // header
    pub major_version: u16,
    pub minor_version: u16,
    pub axes_offset: Offset16,
    reserved: u16,
    axis_count: u16,
    axis_size: u16,
    pub instance_count: u16,
    pub instance_size: u16,
    // in general you can specify with annotations where the data should come from
    #[font_thing(count(axis_count), offset(axes_offset))]
    pub axes: Array<'a, VariationAxisRecord>,
    // in fancy cases  you can throw up your hands and just name arguments to
    // be passed in at construction time?
    #[font_thing(args(instance_count, axis_count, instance_size))]
    pub instances: FvarInstancesArray<'a>,
}

maybe this wasn't the best example; I'll include a bit more code below, but I mainly want to highlight the basics:

axes and instances are 'array' types; they wrap a blob of data, and provide an array-like API for random access. Most of the time, there can be shared types to do this (the Array, type, here) but in the specific case of the instances field on fvar we bail and use a custom type; we specify in the macro some arguments that should be passed to that type's constructor.
when a type is instantiated, any scalars that are fields of that type are copied & converted
then they act just like normal fields, and are native rust data types.
alternatively, we could generate methods instead of fields for the non-scalars, which would be easier in some ways.

Is this a non-starter?

For the interested, this is just me sketching out how this might work implementation-wise:

/// Variation axis record.
#[derive(FontThing)]
pub struct VariationAxisRecord {
    // ... omitted for brevity
}

// no derive, because we construct this by hand?
pub struct InstanceRecord<'a> {
    pub subfamily_name_id: u16,
    pub flags: u16,
    pub coordinates: Array<'a, Fixed>,
    pub postscript_name_id: Option<u16>,
}

// no derive, we construct this by hand
pub struct FvarInstancesArray<'a> {
    blob: Blob<'a>,
    count: u16,
    instance_size: u16,
    axis_count: u16,
}

impl<'a> FvarInstancesArray<'a> {
    fn get(&self, ix: usize) -> Option<InstanceRecord<'a>> {
        let start = ix * self.instance_size as usize;
        let subfamily_name_id = blob.read(start)?;
        let flags = blob.read(start + 2)?;
        let coords_len = Fixed::SIZE * (self.axis_count as usize);
        let coords_blob = blob.get(start + 4..start + 4 + coords_len)?;
        let coordinates = Array::new(coords_blob);
        let has_ps_name = instance_size - (axis_count * Fixed::SIZE) == 6;
        let postscript_name_id: Option<u16> = if has_ps_name {
            blob.read(start + 4 + coords_len)
        } else {
            None
        };
        InstanceRecord {
            subfamily_name_id,
            flags,
            coordinates,
            postscript_name_id
        }
    }
}

rsheeter commented 2 years ago

Couple of immediate questions:

In what relevant way(s) is transmute more restricted than C++?
Am I right in reading this as when you read one field of an object with N you pay for all of them? - I'm picturing wanting a field here and a field there starting to add up (I need upem and
Is the idea we are hand-writing the FvarInstancesArray::get method? - I'd really like to avoid hand-written read at offset code

rsheeter commented 2 years ago

Aside: I was initially really liking the zerocopy idea (it's just like HarfBuzz and we know that works) but after the discussion today I began to think I undervalued the safety (no reinterpret_cast) of the view approach. Both still seem like good options to me.

Do we have any insight into whether https://github.com/rust-lang/project-safe-transmute is likely to come through?

cmyr commented 2 years ago

Couple of immediate questions:

In what relevant way(s) is transmute more restricted than C++? The one way I was thinking of, in particular, has to do with the strict aliasing rules rust has around &mut; for instance in rust transmuting &T to &mut T is insta UB, even if you never read or write to the new reference. I'm not familiar enough with the C++ standard (nor really with unsafe rust) to speak authoritatively here, but there are lots of other ways to get UB out of transmute: screwing up alignment or padding, trasmuting to a type with non-representable bit-patterns, etc.

Am I right in reading this as when you read one field of an object with N you pay for all of them? - I'm picturing wanting a field here and a field there starting to add up (I need upem and

yes, in this case if you need one scalar you would pay for all scalars on that structure. What I'm curious about is how often, in practice, this is a problem. Most structures are small: a handful of scalars. We only pay this cost once, and we can reuse the result. Perhaps at the leaves we end up needing to hit most fields anyway, and doing the reads all at once keeps us in L1 more? Maybe if we want to have a 100% safe API, doing the reads all at once lets us do less bounds checking? I don't know if any of these are good arguments, I'm just thinking out loud.

Is the idea we are hand-writing the FvarInstancesArray::get method? - I'd really like to avoid hand-written read at offset code

That last fvar field is particularly hairy, and maybe wasn't the best example. It will need to be special-cased somehow. I think my preferred implementation would be something where you just provide a custom function that calculates the parameters that describe the array (start offset, n_items, item size) and then it's a common array impl that handles calculating offsets for individual members.

Aside: I was initially really liking the zerocopy idea (it's just like HarfBuzz and we know that works) but after the discussion today I began to think I undervalued the safety (no reinterpret_cast) of the view approach. Both still seem like good options to me.

I agree that there's a lot to like about the zerocopy struct model, but the various little paper cuts are pushing me back towards the view approach. It's feasible to have a no-unsafe version of the API, and we should end up with essentially the same ASM as we'd get from current HarfBuzz; modulo bounds checking, (and that is a dial we can turn) it's all going to end up being array access and BE/LE conversions.

Do we have any insight into whether https://github.com/rust-lang/project-safe-transmute is likely to come through?

I don't have any special insight, but I've poked around the discussion and it seems like there is some interest and a basic design, and some initial PRs have landed. It doesn't look like there are any huge blockers, but it's definitely a ways away from stabilizing.

rsheeter commented 2 years ago

IMO it makes sense to prototype both view and zerocopy style. Hopefully a macro (or script or whatever generator ... just exploration for now) can be made to spit out both so we can play with them a little? One could even do a hybrid with zerocopy struct access and methods to access the fields in friendly types.

IMHO the appeal of the zerocopy style goes up if a blessed implementation seems likely to come.

We only pay this cost once, and we can reuse the result

Isn't that only true if we hang onto the result? - for example, iiuc if I read ... idk, upem, ... repeatedly in the most naive way I might be surprised how many bytes I'm accessing

dfrg commented 2 years ago

yes, in this case if you need one scalar you would pay for all scalars on that structure. What I'm curious about is how often, in practice, this is a problem.

With my perf cap on, I have no real objection to this. The largest offender here is the OS/2 table at 100 bytes for version 5. In reality, any high performance rasterizer or shaper is going to read and cache the pertinent values up front. Even if you don't persist the values across rasterization/shaping operations, the cost of reading these tables will be dominated by that of decoding outlines or applying layout subtables.

I should also say that fast shaping absolutely requires auxiliary data structures such as the accelerators in HarfBuzz (swash does something different but similar; I perhaps should have described some of this during the meeting and I'm happy to do so next time).

Isn't that only true if we hang onto the result? - for example, iiuc if I read ... idk, upem, ... repeatedly in the most naive way I might be surprised how many bytes I'm accessing

I feel okay classifying something like font.head().units_per_em in an inner loop as user error. Even in the case where the head table itself is zerocopy, this still requires a binary search to locate the table record, which is likely more expensive than reading the collection of fields.

dfrg commented 2 years ago

I feel like writing is the big open question here. HarfBuzz and pinot (+swash) are functional examples of the zerocopy and view methods of reading, respectively, and I think both provide nice APIs and good performance.

How does writing look with this hybrid method? I assume the FontThing trait would provide some method that can perhaps write into any type that impls Extend<u8>?

rsheeter commented 2 years ago

I agree re reading upem in a tight loop. If my font root object bsearches for head every access ... maybe it needs an accelerator :D

IIUC the implications are:

We're making a allocation, which isn't free even if it's fast (that struct stack allocs right?)
- actually ... isn't this true of the view approach too? To dance through a graph you'd allocate lots of slices whereas zerocopy access is mostly zeroalloc, even on stack?
We copy and reorder a bunch of extra fields every time we read a structured object and I would expect we read structured objects quite a lot, which again isn't free even if it's fast

Is it fair to say that implies a starting preference of zerocopy > view > hybrid, and while we can argue the cost isn't enough to matter it's not fair to claim it's nothing at all?

I could see choosing view anyway for safety but I'm failing to grasp an advantage that would lead us to favor this hybrid?

cmyr commented 2 years ago

I feel like writing is the big open question here. HarfBuzz and pinot (+swash) are functional examples of the zerocopy and view methods of reading, respectively, and I think both provide nice APIs and good performance.

How does writing look with this hybrid method? I assume the FontThing trait would provide some method that can perhaps write into any type that impls Extend<u8>?

Definitely one advantage of the 'just structs' model that I've been underselling (and this applies in theory both to the idea sketched out here as well as the general HarfBuzz approach) is that you have a single set of types that you can (hopefully) use when both reading and writing.

Although there are still parts I'm not at all sure about (what is an Array backed by, when we're writing? does each array get its own buffer, and can it be mutated after creation, or do you just create it once and then it's immutable?) I imagine some kind of write_to(&self, buf: &mut Sink) method, where Sink is... something. Maybe just a std::io::Write? maybe a &mut [u8] (which has been pre-sized, based on the reported size of the object?) Extend<u8> is another option, although it isn't a pattern I see as often?

IMO it makes sense to prototype both view and zerocopy style. Hopefully a macro (or script or whatever generator ... just exploration for now) can be made to spit out both so we can play with them a little?

The only slight complication here is that there are two different types of macro we might use: derive macros are limited to only working on type declarations, but they have clearer semantics, are more idiomatic, and a bit easier to write. Free-form 'function-like' macros have no rules, your input is an arbitrary token stream and you can output whatever you want, e.g. you can use them to create an entire DSL.

IMO it makes sense to prototype both view and zerocopy style. Hopefully a macro (or script or whatever generator ... just exploration for now) can be made to spit out both so we can play with them a little? One could even do a hybrid with zerocopy struct access and methods to access the fields in friendly types.

IMHO the appeal of the zerocopy style goes up if a blessed implementation seems likely to come.

Agreed, although it's an unknown how that core implementation will relate to existing implementations, or what the timeframe would be.

I agree re reading upem in a tight loop. If my font root object bsearches for head every access ... maybe it needs an accelerator :D

IIUC the implications are:

We're making a allocation, which isn't free even if it's fast (that struct stack allocs right?)

the struct is on the stack, so no allocation in the sense of no malloc, but there will be reserved space in the generated code.

actually ... isn't this true of the view approach too? To dance through a graph you'd allocate lots of slices whereas zerocopy access is also zeroalloc?

Slices are generally a pointer + a length, so two words. In practice that may get optimized away, but holding on to a particular view into some part of the font bytes should not be more expensive than holding on to the font bytes directly, I don't think?

We copy and reorder a bunch of extra fields every time we read a structured object and I would expect we read structured objects quite a lot, which again isn't free even if it's fast

This is true, and maybe I just need to actually do some measuring here. My hunch is that this cost is going to be low relative to overall runtime costs, but how about I try to get some real data and get back to you. :)

Is it fair to say that implies a starting preference of zerocopy > view > hybrid, and while we can argue the cost isn't enough to matter it's not fair to claim it's nothing at all?

definitely don't want to make that claim! I'm just trying to tune my intuition for exactly how much weight this cost should be given.

I could see choosing view anyway for safety but I'm failing to grasp an advantage that would lead us to favor this hybrid?

The main advantage is it gives us the simplest API, the most idiomatic code (declaring a struct using native types and slapping a #[derive(..)]` on it) and also that it opens the possibility to reusing most types for reading and writing. Similarly the types that we generate this way are a 1:1 representation of the types in the spec, hopefully.

There's one issue with zerocopy/bytemuck that we haven't discussed that much, and that's that they won't play nicely with our array types. Traits like zerocopy::FromBytes don't have support for variable length fields: we can't say "calculate a length N based on these previous fields, and then the next N bytes are actually an array. This probably isn't a strict limitation of the language, but it is not a case that these crates have considered, which means we can't really use their derive macros, and so we lose their safety guarantees.

One partial solution here is what chad did in https://github.com/dfrg/pinot/blob/eff5239018ca50290fb890a84da3dd51505da364/src/fvar1.rs, and split this up into zerocopy structs (like the header type) and then having more view-like methods for accessing collections. This means you don't have the full "this struct maps exactly to the spec" benefit, but it's closer than the view-only approach?

rsheeter commented 2 years ago

some real data

It would be interesting to see in godbolt. My impression is you'd get quite a few more instructions in some models vs others. I do agree it's likely to be low relative to total ... but we're trying to beat a C++ codebase that was tuned for 10+ years so I suspect - but have not yet shown - this could prove to matter.

it opens the possibility to reusing most types for reading and writing

I was imagining this to pretty much be true for all 3 options, is that not true? - for example, I would think I can allocate a vec of some zerocopy type (perhaps when compiling) or a vec of pointers to a zerocopy type (perhaps when subsetting) and memcpy their bytes into an output buffer.

dfrg commented 2 years ago

I agree re reading upem in a tight loop. If my font root object bsearches for head every access ... maybe it needs an accelerator :D

+1

but we're trying to beat a C++ codebase that was tuned for 10+ years so I suspect - but have not yet shown - this could prove to matter.

FWIW, swash is competitive with HarfBuzz here, with big wins in some cases. See: https://github.com/harfbuzz/harfbuzz/issues/3221 It is not there yet wrt correctness, but I don’t foresee any necessary changes that will adversely affect perf.

It might be useful to enumerate priorities to make sure we’re on the same page? As I see it:

Correct + safe
Fast
Hackable
Ergonomic

I specifically put ergonomics last because I don’t think the additional accessor calls on the endian swapping types should discredit the zerocopy model. I assume most developers will be interacting with this library through a higher level interface: compiler, subsetter, sanitizer, shaper, etc.

dfrg commented 2 years ago

I was imagining this to pretty much be true for all 3 options, is that not true? - for example, I would think I can allocate a vec of some zerocopy type (perhaps when compiling) or a vec of pointers to a zerocopy type (perhaps when subsetting) and memcpy their bytes into an output buffer.

zerocopy is the only one that gives us this for free for aggregates with purely blittable elements but the others can be handled by code generation.

rsheeter commented 2 years ago

Wrt priorities I don't think it's quite that clean, we're balancing tradeoffs and they don't get a crisp absolute priority like that. Here's a first attempt to sketch how I see it, higher priority items listed first:

What	Why
Safe, fast, memory-efficient	If it's not fast enough or it hogs memory nobody will ship it. We would accept unsafe - that we can convince ourselves to be safe - if it got us substantial perf. However, if we can have safety AND speed then it's a pretty easy choice to take that path.
Ergonomic	We need mortals to be able to work on it and with it
Hackable	I waffled on putting this sibling with ergonomic

I will note that my hope is that Rust will prove to offer us a path where we don't have to make all that much of a tradeoff, we can have it all! Both zerocopy and views strike me as capable of delivering all of this .

I don’t think the additional accessor calls on the endian swapping types should discredit the zerocopy model

Shouldn't we penalize view for this too? - calling .somefield() is an accessor. A zerocopy struct with pinot-style accessors that get inlined kind of appeals to me.

dfrg commented 2 years ago

Here's a first attempt to sketch how I see it, higher priority items listed first

Your more nuanced take looks good to me. Although I was defining hackability as being able to modify the code and ergonomics as a sort of fluency when consuming the code. In my mind, I'm drawing a sharp distinction between "I want the (possibly variable) advance width for character 'x'" and "I want to pick at the data in the hmtx/HVAR tables." IMO the former should be provided by a higher level library that presents a more comprehensive model of the font and the scope of this library should be limited to the latter. I'm not sure if we're all in agreement on this.

I will note that my hope is that Rust will prove to offer us a path where we don't have to make all that much of a tradeoff, we can have it all! Both zerocopy and views strike me as capable of delivering all of this .

Agreed! I think we can get there.

Shouldn't we penalize view for this too? - calling .somefield() is an accessor. A zerocopy struct with pinot-style accessors that get inlined kind of appeals to me.

My thought is that we shouldn't penalize either for this. Coincidentally, Rust permits having a field and method with the same name, so we can have our cake and eat it too.

cmyr commented 2 years ago

edit: didn't see the later replies.

Glad to see the priorities expressed clearly, this is helpful. The take away for me at this point is that I need to write some code. :)

Raph also mentioned potentially wanting the ability to run in 'safe mode' which I'd think of as basically 'zero unsafe code'. Do we want this to be a design goal? It does handcuff us somewhat: it rules out zerocopy, and it also means, in the view case, that each field access would likely need to be bounds checked.

In any case I'm going to write up some code that generates a few different implementations, and we can compare them.

One QQ: anyone have some nice ideas for some access patterns / toy programs that i can try with various implementations, to get a sense for actual perf/codegen? preferably something that is realistic, part of the hot path, and doesn't involve implementing all of GPOS/GSUB. I was thinking of maybe something like walking a string and getting the bbox for each glyph? so hitting cmap -> loca -> glyf. Does anyone have a better idea?

Shouldn't we penalize view for this too? - calling .somefield() is an accessor. A zerocopy struct with pinot-style accessors that get inlined kind of appeals to me.

My thought is that we shouldn't penalize either for this. Coincidentally, Rust permits having a field and method with the same name, so we can have our cake and eat it too.

In terms of generated code, I think that the zerocopy struct model and the view model are very nearly equivalent, so how we weigh them against each other is going to come down to other differences like readability, reusability (for compilation, say) safety, and other secondary things like that.

rsheeter commented 2 years ago

the former should be provided by a higher level library that presents a more comprehensive model of the font and the scope of this library should be limited to the latter

Makes sense to me, barring some compelling reason to do otherwise.

rsheeter commented 2 years ago

Filed #11 to track finding out how safe transmute is going. https://github.com/jswrenn/typic, a proof of concept implementation that led to https://github.com/jswrenn/project-safe-transmute/blob/rfc/rfcs/0000-safe-transmute.md#safe-transmute-rfc, looks kind of promising.

dfrg commented 2 years ago

Raph also mentioned potentially wanting the ability to run in 'safe mode' which I'd think of as basically 'zero unsafe code'. Do we want this to be a design goal? It does handcuff us somewhat: it rules out zerocopy, and it also means, in the view case, that each field access would likely need to be bounds checked.

If we’re using macros, we should be able to feature gate on this and have the macros emit the appropriate code. This also gives us the ability to measure the performance difference between the two and potentially drop the unsafe entirely.

I was thinking of maybe something like walking a string and getting the bbox for each glyph?

This sounds good to me. The format switching in cmap is a good test as I think it might be a pain point with zerocopy.

googlefonts / oxidize

A hybrid, safe approach #10

the important question part

the explanation part