Other types to support?

smessmer / binary-layout

The binary-layout library allows type-safe, inplace, zero-copy access to structured binary data. You define a custom data layout and give it a slice of binary data, and it will allow you to read and write the fields defined in the layout from the binary data without having to copy any of the data. It's similar to transmuting to/from a #[repr(packed)] struct, but much safer.

Apache License 2.0

66 stars 9 forks source link

Other types to support? #13

Closed ckaran closed 5 months ago

ckaran commented 2 years ago

Not really an issue, but something to think about for the future of this project. Do you want to support any of the types in libcore? Or is that out of scope for this project? I'm thinking about things like the atomic types, which have simple mappings to other primitive types. The issue for the atomics in particular is that you need to specify the access ordering, which means that we'd be making statements regarding ordering that end users don't want applied.

Alternatively, you could declare that anything in any library as out of scope of the project. That is probably the simplest, and would let you polish and finish the project. Updates would only include any new language primitives that are added (e.g., never, should it ever stabilize).

smessmer commented 2 years ago

Atomics are difficult since people would probably assume that accesses to the underlying memory location would be synchronized. If we find a way to implement this, then adding it may be a good idea but I don't think that would be easy to implement for us, if at all possible.

There are a couple of types mentioned in the project README that may make sense adding, like boolean or slices over primitive types larger than u8.

For types from other libraries, we'd have to decide on a case by case basis but they'd have to be common enough to justify the dependency. I currently can't think of any that would fit that requirement.

There's also always the option to have a separate crate that users can depend on additionally if they wish and that crate is using LayoutAs to extend the types we support. In the current LayoutAs design, they'd have to use newtype wrappers if they implement it for types not in their own crate, but maybe we can fix that and find a way to allow them to implement support for types without newtype wrappers

ckaran commented 2 years ago

Atomics are difficult since people would probably assume that accesses to the underlying memory location would be synchronized. If we find a way to implement this, then adding it may be a good idea but I don't think that would be easy to implement for us, if at all possible.

Yup, exactly what I was thinking as well. Since we lack knowledge that others have, we'd have to make a choice about the synchronization ordering, which for safety's sake would likely be the most conservative choice.

There are a couple of types mentioned in the project README that may make sense adding, like boolean or slices over primitive types larger than u8.

Slices of primitive types larger than u8 would probably be OK. Booleans are going to be a headache. Sometimes you're willing to waste a whole byte to store a single bit, and sometimes you're not. If it were possible to make binary-layout aware of the storage it were writing in to, then it's behavior could be tailored based on whether or not it was writing into a [u8] or something like bitvec. Hmmm... I think I've just convinced myself that booleans cannot be directly supported as a primitive type...

For types from other libraries, we'd have to decide on a case by case basis but they'd have to be common enough to justify the dependency. I currently can't think of any that would fit that requirement.

Sorry, I was being unclear. I wasn't thinking about any libraries other than core and std. So, basically I was saying that you could declare that this crate is not only #[no_std], but also #[no_core]. The result would be that only those items that are defined by the language itself could have this trait implemented on them.

There's also always the option to have a separate crate that users can depend on additionally if they wish and that crate is using LayoutAs to extend the types we support. In the current LayoutAs design, they'd have to use newtype wrappers if they implement it for types not in their own crate, but maybe we can fix that and find a way to allow them to implement support for types without newtype wrappers

I think that this is going to be the best idea. Maybe something similar to the following:

pub trait TryLayoutAs {
    type Higher;
    type Lower;
    type Error: std::error::Error;

    fn try_read(&self, v: Self::Lower) -> Result<Self::Higher, Self::Error>;
    fn try_write(&self, v: Self::Higher) -> Result<Self::Lower, Self::Error>;
}

This will let users write crates with types that can be piped from one type to another relatively easily. As an alternative, it may be useful to consider something like the pipe trait, which will make it easier to move between different higher and lower levels as needed.

smessmer commented 2 years ago

Atomics are difficult since people would probably assume that accesses to the underlying memory location would be synchronized. If we find a way to implement this, then adding it may be a good idea but I don't think that would be easy to implement for us, if at all possible.

Yup, exactly what I was thinking as well. Since we lack knowledge that others have, we'd have to make a choice about the synchronization ordering, which for safety's sake would likely be the most conservative choice.

I was more worried around direct access to the underlying storage. The binary-layout View is a view over a &[u8] or similar storage, but it doesn't necessarily have exclusive access to that storage. We can implement ordering of our own accesses, but then some other user code may just directly access the storage and ignore the synchronization entirely.

There are a couple of types mentioned in the project README that may make sense adding, like boolean or slices over primitive types larger than u8.

Slices of primitive types larger than u8 would probably be OK. Booleans are going to be a headache. Sometimes you're willing to waste a whole byte to store a single bit, and sometimes you're not. If it were possible to make binary-layout aware of the storage it were writing in to, then it's behavior could be tailored based on whether or not it was writing into a [u8] or something like bitvec. Hmmm... I think I've just convinced myself that booleans cannot be directly supported as a primitive type...

I think boolean can be supported. We need some syntax to distinguish between bytes and bits. bool traditionally means byte, so we'd have to implement something for bits. Maybe C's struct syntax of bool:1 or maybe just bit as a type. It becomes more difficult when thinking about alignment, what if there isn't a multiple of 8 bits? Do we pad to a full byte? This goes somewhat against the crate philosophy of using packed layout. And things like network packets may actually want to use half a byte as flags and then an unaligned full byte for something else. However, independent from all that, I think supporting a full-byte bool should be pretty easy.

There's also always the option to have a separate crate that users can depend on additionally if they wish and that crate is using LayoutAs to extend the types we support. In the current LayoutAs design, they'd have to use newtype wrappers if they implement it for types not in their own crate, but maybe we can fix that and find a way to allow them to implement support for types without newtype wrappers

I think that this is going to be the best idea. Maybe something similar to the following:
pub trait TryLayoutAs {
    type Higher;
    type Lower;
    type Error: std::error::Error;

    fn try_read(&self, v: Self::Lower) -> Result<Self::Higher, Self::Error>;
    fn try_write(&self, v: Self::Higher) -> Result<Self::Lower, Self::Error>;
}
This will let users write crates with types that can be piped from one type to another relatively easily. As an alternative, it may be useful to consider something like the pipe trait, which will make it easier to move between different higher and lower levels as needed.

Yes, something like that may work. We'd have to figure out how the binary-layout crate finds the right implementations of these traits.

ckaran commented 2 years ago

I was more worried around direct access to the underlying storage. The binary-layout View is a view over a &[u8] or similar storage, but it doesn't necessarily have exclusive access to that storage. We can implement ordering of our own accesses, but then some other user code may just directly access the storage and ignore the synchronization entirely.

In short, atomics are off the table! :)

I think boolean can be supported. We need some syntax to distinguish between bytes and bits. bool traditionally means byte, so we'd have to implement something for bits. Maybe C's struct syntax of bool:1 or maybe just bit as a type. It becomes more difficult when thinking about alignment, what if there isn't a multiple of 8 bits? Do we pad to a full byte? This goes somewhat against the crate philosophy of using packed layout. And things like network packets may actually want to use half a byte as flags and then an unaligned full byte for something else. However, independent from all that, I think supporting a full-byte bool should be pretty easy.

For a 1 byte bool, we need to make some decisions as to what the extra bits mean. In C, 0 means false, and all other values mean true. What about in rust? If I read in something that has 'extra' bits set, do I ignore them, or throw an error? What happens if the bit that stored the bool in is 0, but at least some of the higher-order bits are 1? My concern is that those extra bits could have meaning, but aren't able to be correctly accessed via this crate. That was why I was thinking it would be better to let other crates define how to do the layout.

As for packing and misaligning... that makes life MUCH more difficult. If you really, really want to go that route, look at bitvec::slice, and in particular bitvec::slice::BitSlice which will enable what you're looking for, with a minimum of headaches.

Yes, something like that may work. We'd have to figure out how the binary-layout crate finds the right implementations of these traits.

I actually wasn't thinking of binary-layout finding the crates at all, that's the end user's job. After all, there may be many different ways of laying the data out, each depending on a different use case. For example, laying out a boolean to a u8 may require that the most significant bit be set to 1, and all others to 0. Or maybe that'll be the least significant bit. Or... there can be a lot of choices, and in some cases the choices may change depending on which field is being laid out. We can't know that, so let the user select one of many crates that implement TryLayoutAs, and trust that they know which layout they want.

smessmer commented 2 years ago

I think boolean can be supported. We need some syntax to distinguish between bytes and bits. bool traditionally means byte, so we'd have to implement something for bits. Maybe C's struct syntax of bool:1 or maybe just bit as a type. It becomes more difficult when thinking about alignment, what if there isn't a multiple of 8 bits? Do we pad to a full byte? This goes somewhat against the crate philosophy of using packed layout. And things like network packets may actually want to use half a byte as flags and then an unaligned full byte for something else. However, independent from all that, I think supporting a full-byte bool should be pretty easy.

For a 1 byte bool, we need to make some decisions as to what the extra bits mean. In C, 0 means false, and all other values mean true. What about in rust? If I read in something that has 'extra' bits set, do I ignore them, or throw an error? What happens if the bit that stored the bool in is 0, but at least some of the higher-order bits are 1? My concern is that those extra bits could have meaning, but aren't able to be correctly accessed via this crate. That was why I was thinking it would be better to let other crates define how to do the layout.

The most conservative choice would probably be to write/read as 0_u8 and 1_u8 and throw an error for any other value. If we later want to give meaning to other values, it's more backwards compatible to later change the crate so an error condition isn't an error anymore than redefining interpretation of a non-error case.

As for packing and misaligning... that makes life MUCH more difficult. If you really, really want to go that route, look at bitvec::slice, and in particular bitvec::slice::BitSlice which will enable what you're looking for, with a minimum of headaches.

Yes, it does make things harder. It would mean that all our types need to eventually support byte-unaligned layouts. However, we could probably have a first version where we don't support it for all types yet and throw a compiler error if users try to write an unaligned layout for a type we haven't implemented it for yet.

Yes, something like that may work. We'd have to figure out how the binary-layout crate finds the right implementations of these traits.

I actually wasn't thinking of binary-layout finding the crates at all, that's the end user's job. After all, there may be many different ways of laying the data out, each depending on a different use case. For example, laying out a boolean to a u8 may require that the most significant bit be set to 1, and all others to 0. Or maybe that'll be the least significant bit. Or... there can be a lot of choices, and in some cases the choices may change depending on which field is being laid out. We can't know that, so let the user select one of many crates that implement TryLayoutAs, and trust that they know which layout they want.

I meant binary-layout finding them in the same way it finds the LayoutAs implementations today. It just checks if a given type implements LayoutAs, and if it does, then it found it. In the traits you proposed, they may be more difficult to find because the trait isn't directly implemented for a type.

ckaran commented 2 years ago

The most conservative choice would probably be to write/read as 0_u8 and 1_u8 and throw an error for any other value. If we later want to give meaning to other values, it's more backwards compatible to later change the crate so an error condition isn't an error anymore than redefining interpretation of a non-error case.

When you say 'throw an error', do you literally mean panicking? Or do you mean you want to change the APIs so that they return Result instead? Just want to make sure we're both on the same page as to what is being done here.

Yes, it does make things harder. It would mean that all our types need to eventually support byte-unaligned layouts. However, we could probably have a first version where we don't support it for all types yet and throw a compiler error if users try to write an unaligned layout for a type we haven't implemented it for yet.

I think that with judicious use of bitvec::slice::BitSlice, we may be able to create a universal adapter. So, kind of like PrimitiveField<T: ?Sized, E: Endianness, const OFFSET_: usize, const BIT_OFFSET: usize>, with some compiler macro magic to generate versions of this for each of the possible bit offsets. Once that's done, I think that the rest of the macros can be tweaked over relatively easily. This will require quite a bit of thought and effort to get right though...

I meant binary-layout finding them in the same way it finds the LayoutAs implementations today. It just checks I'd a given type implements LayoutAs, and if it does, then it found it. In the traits you proposed, they may be more difficult to find because the trait isn't directly implemented for a type.

OK, so if you have to mix endianness, you'd use something like the following, correct?:

define_layout!(little_u128, LittleEndian, {
  field1: u128
});

define_layout!(big_u128, BigEndian, {
  field1: u128
});

define_layout!(little_u128, LittleEndian, {
  a: little_u128,
  b: big_u128
});

smessmer commented 2 years ago

The most conservative choice would probably be to write/read as 0_u8 and 1_u8 and throw an error for any other value. If we later want to give meaning to other values, it's more backwards compatible to later change the crate so an error condition isn't an error anymore than redefining interpretation of a non-error case.

When you say 'throw an error', do you literally mean panicking? Or do you mean you want to change the APIs so that they return Result instead? Just want to make sure we're both on the same page as to what is being done here.

Panicking for invalid inputs doesn't sound like a good idea. Returning Results may be better though not ideal. Maybe my idea of returning an error isn't that great after all.

Yes, it does make things harder. It would mean that all our types need to eventually support byte-unaligned layouts. However, we could probably have a first version where we don't support it for all types yet and throw a compiler error if users try to write an unaligned layout for a type we haven't implemented it for yet.

I think that with judicious use of bitvec::slice::BitSlice, we may be able to create a universal adapter. So, kind of like PrimitiveField<T: ?Sized, E: Endianness, const OFFSET_: usize, const BIT_OFFSET: usize>, with some compiler macro magic to generate versions of this for each of the possible bit offsets. Once that's done, I think that the rest of the macros can be tweaked over relatively easily. This will require quite a bit of thought and effort to get right though...

Agreed

I meant binary-layout finding them in the same way it finds the LayoutAs implementations today. It just checks I'd a given type implements LayoutAs, and if it does, then it found it. In the traits you proposed, they may be more difficult to find because the trait isn't directly implemented for a type.

OK, so if you have to mix endianness, you'd use something like the following, correct?:
define_layout!(little_u128, LittleEndian, {
  field1: u128
});

define_layout!(big_u128, BigEndian, {
  field1: u128
});

define_layout!(little_u128, LittleEndian, {
  a: little_u128,
  b: big_u128
});

Haven't put much thought into mixed endianness yet. Not sure if we need to support that, it seems rare that a package layout isn't consistent in endianness. If we do need to support it, we can think about any DSL syntax, doesn't need to be in the type. We can add a modified syntax similar to how the "as" keyword in a layout modifies a type today. Maybe

define_layout!(little_u128, LittleEndian, {
   a: u128 LittleEndian,
   b: u128 BigEndian,
 });

ckaran commented 2 years ago

Panicking for invalid inputs doesn't sound like a good idea. Returning Results may be better though not ideal. Maybe my idea of returning an error isn't that great after all.

Well, that's the advantage of the current set of supported types; there are no invalid bit patterns, so the current APIs can remain as infallible objects. Booleans will force us to make choices that I'm uncomfortable with because either we're wasting bits (and making choices as to what to set those bits to), or we're misaligning and require more work to get it all correct. Implementing something like TryLayoutAs makes it possible to have multiple adapter types, one for each way that you might want to layout a boolean as. It also lets you have a non-panicking way to report errors, which is always a nice to have feature.

Haven't put much thought into mixed endianness yet. Not sure if we need to support that, it seems rare that a package layout isn't consistent in endianness. If we do need to support it, we can think about any DSL syntax, doesn't need to be in the type. We can add a modified syntax similar to how the "as" keyword in a layout modifies a type today. Maybe

define_layout!(little_u128, LittleEndian, {
   a: u128 LittleEndian,
   b: u128 BigEndian,
 });

I'm assuming that the lowest-level object's endianess attribute is the 'winner', correct? If so, I like this. It will allow composing types together in a very natural way, e.g.:

define_layout!(Header, BigEndian, {
    IPv6_address: u128
});

define_layout!(Datum, LittleEndian, {
    header: Header,
    contents: [u8; 1024],
    trailer: u128 BigEndian
});

It may also be possible to convert define_layout! into a derive macro, which (IIUC) will handle recursion into composed types correctly. I don't know much about writing macros though, so I'm only guessing at that last part...

The main advantage of turning this into a derive macro is that the syntax will be more familiar to end users. They'll know that #[LittleEndian] and #[BigEndian] are attributes that they can add to any field, and that if they don't add it, then the attribute attached to the type as a whole wins (#[derive(..., BinaryLayout(LittleEndian), ...)])

smessmer commented 2 years ago

Panicking for invalid inputs doesn't sound like a good idea. Returning Results may be better though not ideal. Maybe my idea of returning an error isn't that great after all.

Well, that's the advantage of the current set of supported types; there are no invalid bit patterns, so the current APIs can remain as infallible objects. Booleans will force us to make choices that I'm uncomfortable with because either we're wasting bits (and making choices as to what to set those bits to), or we're misaligning and require more work to get it all correct. Implementing something like TryLayoutAs makes it possible to have multiple adapter types, one for each way that you might want to layout a boolean as. It also lets you have a non-panicking way to report errors, which is always a nice to have feature.

TryLayoutAs sounds great and it solves it for user defined types and when somebody writes "bool as u8", but it wouldn't allow users to just write "bool". Would TryLayoitAs allow failures on both reading and writing? LayoutAs<Result<_>> may cover some of the use cases.

Haven't put much thought into mixed endianness yet. Not sure if we need to support that, it seems rare that a package layout isn't consistent in endianness. If we do need to support it, we can think about any DSL syntax, doesn't need to be in the type. We can add a modified syntax similar to how the "as" keyword in a layout modifies a type today. Maybe
define_layout!(little_u128, LittleEndian, {
   a: u128 LittleEndian,
   b: u128 BigEndian,
 });
I'm assuming that the lowest-level object's endianess attribute is the 'winner', correct? If so, I like this. It will allow composing types together in a very natural way, e.g.:
define_layout!(Header, BigEndian, {
    IPv6_address: u128
});

define_layout!(Datum, LittleEndian, {
    header: Header,
    contents: [u8; 1024],
    trailer: u128 BigEndian
});
It may also be possible to convert define_layout! into a derive macro, which (IIUC) will handle recursion into composed types correctly. I don't know much about writing macros though, so I'm only guessing at that last part...

The main advantage of turning this into a derive macro is that the syntax will be more familiar to end users. They'll know that #[LittleEndian] and #[BigEndian] are attributes that they can add to any field, and that if they don't add it, then the attribute attached to the type as a whole wins (#[derive(..., BinaryLayout(LittleEndian), ...)])

Nested layouts are an interesting idea. I think they can be implemented without a lot of macro wizardry by returning a View when the nested field is accessed.

ckaran commented 2 years ago

TryLayoutAs sounds great and it solves it for user defined types and when somebody writes "bool as u8", but it wouldn't allow users to just write "bool".

Well, I guess if you know that reading/writing is infallible, then you can keep using LayoutAs.

Would TryLayoitAs allow failures on both reading and writing? LayoutAs<Result<_>> may cover some of the use cases.

Yeah, the idea of it is slightly different. LayoutAs depends on the newtype pattern. TryLayoutAs is more like a pipeline object. In theory, you can hook together a series of trait objects, swapping them out as needed at runtime, to produce a final output layout. Both approaches work, it's up to you as to which you'd like to use. The main advantage is that you don't need to create newtypes for each boolean layout you want to use, the main disadvantage is that the newtype pattern helps protect you at compile time by leveraging the type system.

Nested layouts are an interesting idea. I think they can be implemented without a lot of macro wizardry by returning a View when the nested field is accessed.

Good! It will certainly make things easier to implement.

smessmer commented 2 years ago

I have a first experimental implementation of nesting in the feature/nesting branch. That branch also has an example in tests/nested.rs.

ckaran commented 2 years ago

I like where you're going with it! I haven't tested it, and given how work looks like for the next few weeks, I probably won't be able to, but I'll try to pull periodically to see how you're doing. Thank you!

smessmer commented 2 years ago

Nested layouts are implemented in https://github.com/smessmer/binary-layout/commit/3b67f467c7d390125376ba5d2430136e47da6232 and will be part of the 3.0 release

smessmer commented 5 months ago

The 4.0 release now supports types that can throw errors when reading, e.g. 1-byte booleans.

smessmer commented 5 months ago

That also implements TryLayoutAs, but I actually found a way to combine it into the LayoutAs trait without requiring a separate trait. LayoutAs is now always fallible, but if implementors use std::convert::Infallible as the error type, then we will still generate the infallible read/write API for them.

Remaining unimplemented ideas from this discussion:

bools as 1 bit
slices for other types than just u8

Both are mentioned in the project README. I'm closing this issue for now.