Custom vtable API - Githubissues

tokio-rs / bytes

Utilities for working with bytes

MIT License

1.87k stars 278 forks source link

Custom vtable API #567

Open HyeonuPark opened 2 years ago

HyeonuPark commented 2 years ago

This is my suggestion about constructing Bytes with custom vtable. Since it's not a small modification, I want to share my direction to maintainers before writing more docs or bikeshedding etc. You can consider this PR as an executable RFC.

The base idea is to expose a trait purely consists of associated functions. It's not a breaking change to add associated function with default impl to existing trait. Default impl being mandatory can be a problem, but it's an inherent problem of exposing custom vtable.

At this point this PR is consists of 5 commits. They're quite independent that I think each deserves its own PR after the consensus is made. It's possible that some of them will be advanced while others are rejected. First 2 commits are about improving current vtable signatures. it's hard to do so after stabilizing them. Next 2 commits implements custom vtable API. The last one implements downcasting.

The first commit is about changing the name of the vtable entry to_vec to into_vec to match its semantics. Changes in its signature also reduces atomic operations, though performance wasn't a goal here. I hardly expect it actually improve performance.

The second commit is to add will_truncate function to vtable. Currently promotable buffers can't discard bytes from the last without promotion, so Bytes::trucate specialize them by equality checking vtable address with 2 promotable vtable static variables. It doesn't seems like a good idea with open set of vtables.

The third commit implements the BytesImpl trait and the Bytes::with_impl constructor. It would be the main commit of this PR.

The fourth commit occupies majority of diffs this PR have, but all it does it to move functions into impl blocks. It should not add or change any behavior.

The fifth commit is about downcasting. I think round tripping between custom buffers and the Bytes is the next step people would expect. Potentially try_unsplit can also be built on top of it.

Thanks for reading. Any feedback is welcome.

HyeonuPark commented 2 years ago

Fixed minor CI issues.

Replace mem::take with mem::replace
Replace AtomicPtr::get_mut with with_mut for the loom test

ipoupaille commented 2 years ago

There is another related request: #558

Darksonn commented 2 years ago

The implementation seems ok at a cursory glance. To me the main question is whether we want this or not, and if so, how should the public API of it look.

HyeonuPark commented 2 years ago

To clarify, I need this feature to run hyper with zero copy networking. The hyper uses types from the http crate which uses Bytes type everywhere, which means the HTTP headers are copied at least once unless the OS gives you incoming bytes as Bytes type. But to use io_uring optimally you need to pre-register buffers and keep reusing them over and over again, which is not possible with current Bytes interface as it deallocates buffer after the last use. With AF_XDP buffer reuse is mandatory and it has more restrictions like all the IO buffers should be equally sized slices of single large pre-registered allocation.

With custom vtables for Bytes, low level IO wrapper crates can provide its own buffer type with proper restrictions and user of that crate can convert it into Bytes. That's why impl BytesImpl requires tons of unsafe but using such types is totally safe with this PR.

I'm still not sure about those methods of the BytesImpl trait. Currently they're mostly copied as-is from the existing Vtable struct with some fixes.

rrichardson commented 1 year ago

@HyeonuPark - This looks great. One concern I have is to ensure that 3rd party container implementors have everything they need to convert their containers to and from Bytes. Other than the clone method in the BytesImpl trait, I think that they should be able to. I'll try to create an example/test that demonstrates this.

Regarding the Doc comments, I am implementing a branch against your branch that implements documentation that is more closely aligned with Tokio's conventions, and is clippy compliant. I'll make a PR to your repo and you're welcome to merge them in if you want.

@seanmonstar - Is there a way that we can put this implementation in a feature gate, or maybe even a major version RC so that we can incorporate it into projects for better exposure and vetting?

rrichardson commented 1 year ago

Thinking more about fn clone in the impl... Maybe there should be a default implementation of clone, so that the implementor doesn't have to implement their own refcounting.

Or maybe we could implement BytesImpl for Arc where T meets certain criteria?

*edit* I think that most basic buffer types could just leverage the SharedImpl type to do the heavy lifting. It'd be nice to wrap all of that in some slightly more ergonomic types..

rrichardson commented 1 year ago

One more question: Should we specify that from_bytes_parts should panic if refcount > 1, or should it return a Result/Option?

rrichardson commented 1 year ago

For the AtomicPtr bits, we might need to re-export the relevant types fromloom::sync::atomic or core::sync::atomic to be safe. ... and AtomicMut.

I was able to make a working impl of BytesImpl for mmap::Mmap rather easily. I just needed to add a from_parts method to Bytes. (see below)

All that said, I would definitely consider this the low-level and error-prone and certainly not DRY approach to incorporating a buffer type into a Bytes container.

For basic buffer types, (both mutable and immutable) we should be able to make a much simpler interface. clone, into_vec and from_bytes_parts could have default implementations. We'd just need a simple intermediate wrapper type like SharedImpl.

I also think that downcast_impl and from_bytes_parts can be a footgun. If you convert a Bytes instance back into T, but there are still clones of Bytes around, then you create a scenario in which two different objects are responsible for dropping the buffer owned by T. It can only work correctly if T itself had its own internal refcounting that BytesImpl uses, but if it did, why would they need Bytes?

      pub unsafe fn from_parts<T: BytesImpl>(data: &AtomicPtr<()>, ptr: *const u8, len: usize) -> Self {
          let dptr = data.load(Ordering::Acquire);

          Bytes {
              ptr,
              len,
              data: AtomicPtr::new(dptr.cast()),
              vtable: &Vtable {
                  type_id: TypeId::of::<T>,
                  clone: T::clone,
                  will_truncate: T::will_truncate,
                  into_vec: T::into_vec,
                  drop: T::drop,
              },
          }
      }

HyeonuPark commented 1 year ago

It would be nice to make SharedImpl generic over impl AsRef<[u8]> not just for Vec<u8>. Note that it should be one from the bytes_mut.rs since the one with same name in bytes.rs is heavily optimized for Box<[u8]>. Though I'm not sure it's much beneficial since if you have custom buffer type you may want to implement refcount directly on it. It's not that hard when you have working examples on std::sync::Arc and bytes::Bytes code and you don't need to handle weak handles.

refcount > 1 on from_bytes_parts is expected on downcasting. For example if you make fresh struct MyBuffer(Arc<[u8]>, Range<usize>);, convert it into Bytes, clone it 2 times and downcast it back its refcount will be 3 on from_bytes_parts.

I haven't concerned exposing loom AtomicPtr. Maybe we need to newtype it like tokio::time::Instant?

I also think that downcast_impl and from_bytes_parts can be a footgun. If you convert a Bytes instance back into T, but there are still clones of Bytes around, then you create a scenario in which two different objects are responsible for dropping the buffer owned by T. It can only work correctly if T itself had its own internal refcounting that BytesImpl uses, but if it did, why would they need Bytes?

I really should have document if more, that multiple Bytes cloned from the same Bytes doesn't share same instance T: BytesImpl. It's like if you have #[derive(Clone)] struct Wrapper(Arc<[u8]>); multiple clones of it would share [u8] but each have its own instance of Arc<[u8]>.

The reasons why clone and drop exist as own trait methods not just implied from supertrait and automatic drop glue is to allow users to create their own StaticImpl and PromotableImpl. It's possible to make helper struct and trait to remove those methods and all that AtomicPtr and just use &self / &mut self instead. Or, maybe it would be better to make that helper trait as a sole public API, reducing API complexity a lot by sacrificing custom static/promotable buffers.

Edit: StaticImpl still is possible with that API. It can't be made in const context but it's impossible without const traits anyway. So the question is all about custom promotable buffers.

rrichardson commented 1 year ago

I really should have documented if more, that multiple Bytes cloned from the same Bytes doesn't share same instance T: BytesImpl. It's like if you have #[derive(Clone)] struct Wrapper(Arc<[u8]>); multiple clones of it would share [u8] but each have its own instance of Arc<[u8]>.

Right. When you convert one of the instances of Bytes back to T, but other instances of Bytes remain, you now have more than one instance that owns the inner buffer.

Let's say you have:


struct MyBuf {
    inner: Box<[u8]>, 
}

When we convert MyBuf to a Bytes instance, the new instance takes control of the [u8] from inner. When we clone the Bytes instance, we still have only 1 copy of [u8]. This is good.

Now let's convert one of those Bytes instances back into MyBuf We break down the Bytes instance into its parts, then reconstruct MyBuf, giving the ownership of [u8] back to MyBuf. So it controls the lifetime of the inner [u8]].

However, there is still an instance of Bytes, and its copy of SharedImpl also owns the [u8].
So we have a dangling pointer if either MyBuf or that instance of Bytes are destroyed.

We need to find a way to design the API such that it is impossible to convert from Bytes to T if the refcount is > 1, similar to the design of Rc and RefCell. Also, we would need to disallow conversion from Bytes to Bytes Mut if refcount > 1. We can use the same underlying mechanism to enforce both invariants.

If we just built and published the "Inner" Container States API: Shared, SharedMut Promotable, PromotableMut Static, StaticMut (or whatever we end up needing to cover all use cases), then we control the transitions that are possible from each state, using a Type State.

The 3rd party implementors would be able to easily adapt their implementation without having to worry about the internals of memory management, alignment, thread-safety, etc.

Perhaps all of those above types implement the same type, e.g. ManagedBuffer, then Bytes and BytesMut could just store a dyn ManagedBuffer, we wouldn't need to manage the VTable manually. Is this sort of what you were referring to when you said:

It's possible to make helper struct and trait to remove those methods and all that AtomicPtr and just use &self / &mut self instead.

rrichardson commented 1 year ago

At a high level, we'd need a 4d matrix of functionality. These wouldn't necessarily all be separate buffer types, but we need to ensure that we can handle the use-cases.

(de)Allocation Style:

Global Allocator (Global Free)
3rd Party Managed (User Supplied Drop)
Static (Null Drop)

Mutability:

Immutable
Mutable

Immutable -> Mutable Conversion Method

Copy on Write: Copy Buffer to Convert
Mutate Self: Assert Refcount == 1 to Convert

Clone Operation:

Naive/Optimized: Convert to into Refcounted and Increment
Refcounted: Just Increment
Static: No refcounting needed.

I think that we'd also need to add a method to the vtable to accomodate the potential of converting to mutable.
It'd need to be able to communicate whether or not it could mutate without copying, then the actual operation to convert to mutable.

Converting from Mutable to immutable shouldn't need to access the VTable.. I don't think..

rrichardson commented 1 year ago

I'm going to attempt a version that formalizes this lower-level API, and also uses more a Typestate scheme and enum based dispatch, which, among other things, is going to be faster than any vtable. (not that the current impl is an actual ptr based vtable) It should be quite a bit safer, too. Certain state transitions will be inexpressible.

HyeonuPark commented 1 year ago

Let's say you have:

struct MyBuf { inner: Box<[u8]>, }

When we convert MyBuf to a Bytes instance, the new instance takes control of the [u8] from inner. When we clone the Bytes instance, we still have only 1 copy of [u8]. This is good.

MyBuf should not be converted to a Bytes since it doesn't satisfy requirements of the Bytes implementations.

All Bytes implementations must fulfill the following requirements:

They are cheaply cloneable and thereby shareable between an unlimited amount of components, for example by modifying a reference count.

Instances can be sliced to refer to a subset of the the original buffer.

Bytes doesn't have builtin refcounting mechanism. SharedImpl provides one and PromotableImpl relies on it by storing it on one if its (logical) enum variant. If MyBuf is converted to Bytes the Bytes::clone() should clone the stored MyBuf itself. Note that I used struct MyBuffer(Arc<[u8]>, Range<usize>); as an example. It provides cheap cloning via refcounting and subslicing on itself.

Now let's convert one of those Bytes instances back into MyBuf We break down the Bytes instance into its parts, then reconstruct MyBuf, giving the ownership of [u8] back to MyBuf. So it controls the lifetime of the inner [u8]].

However, there is still an instance of Bytes, and its copy of SharedImpl also owns the [u8]. So we have a dangling pointer if either MyBuf or that instance of Bytes are destroyed.

Maybe you want to convert SharedImpl<MyBuf> into Bytes instead? SharedImpl already provides Arc<Vec<u8>>-ish functionality and it's desirable to exploit it but replace the storage type. But in this case downcasting will returns SharedImpl<MyBuf> not MyBuf and like Arc<MyBuf> you should not be able to take ownership of inner type while other shared handle to it is still alive.

We need to find a way to design the API such that it is impossible to convert from Bytes to T if the refcount is > 1, similar to the design of Rc and RefCell. Also, we would need to disallow conversion from Bytes to Bytes Mut if refcount > 1. We can use the same underlying mechanism to enforce both invariants.

BytesMut can have refcount > 1 by ensuring each instance owns disjoint slice of the shared storage. I don't think it's possible to support un-freeze Bytes back to BytesMut without changing the BytesMut's structure. But disallowing refcount > 1 means BytesMut::split() allocate every time.

At this point I think those promotable impls are special enough that 3rd party promotables are not something we need. I think its purpose, besides historical reasons, is to make round trip between Vec<u8> and Bytes as cheap as possible for interoperability. If we remove it from the custom vtable's goal, the BytesImpl trait can be something like this:

unsafe trait BytesImpl: Clone + 'static where Vec<u8>: From<Self> {
    fn into_bytes_parts(this: Self) -> (*mut (), *const u8, usize);
    unsafe fn from_bytes_parts(data: *mut (), ptr: *const u8, len: usize) -> Self;
}

I don't think we need abstraction on mutable conversion. BytesMut doesn't seems for unified mutable buffer type with custom implementations. Each BytesImpl implementations can handle it after downcasting.

HyeonuPark commented 1 year ago

It wasn't intended to make generic SharedImpl a first class citizen. If we decide to make it, it would be a helper type on top of the BytesImpl API for users who don't want to reinvent whole refcounting stuff on their own since we already have one. It at least doesn't cover my own use case.

Let me describe my own use case. First user creates a context. The context mmap a large memory, do some mlock stuff, split them into aligned chunks and push them all into its own queue. Each chunk stores inline refcount and pointer to the context on its first few bytes. User acquire new mutable buffer from context by pop the queue, write into it, and freeze it to construct clonable, sliceable and immutable buffer type. On drop with refcount 0 it pushes itself into the context queue. And I want to make this immutable buffer BytesImpl.

I don't think that the users of MMap and others care if they get an optimization that skips refcounting. They're converting their buffer to Bytes instance expecting it to get the Atomic refcounting baggage.

I'm not sure about it. Single thread runtime or runtime per thread users may want to have hybrid refcount and optimistically use unsynchronized field on original thread. Will it actually boost performance? I don't know but I don't want to prevent it. Also my own use case is an example to manage refcount internally to not have separate allocation for refcount. If user don't want to handle atomic refcounting baggage they may use helper libraries, like generic SharedImpl.

It is perfectly okay for the result of try_downcast() to be Err(Error("refcount > 1")) similar to arc/rc/refcell.

Semantically it's perfectly ok, but is it useful? One of the major use case of BytesMut is to reduce allcations on handling unsized data. To do so each allocations should be large enough to host about 4 to dozen expected size of BytesMut handles. Which means they rarely have refcount 1.

rrichardson commented 1 year ago

It wasn't intended to make generic SharedImpl a first class citizen. If we decide to make it, it would be a helper type on top of the BytesImpl API for users who don't want to reinvent whole refcounting stuff on their own since we already have one. It at least doesn't cover my own use case.

I was hoping that we could offer some intermediate types to make it easier for implementors to integrate with Bytes, but I think you're right, it's not possible in most cases. Later, we could offer a public version that is similar to SharedImpl to make it easy for people with trivial buffer types (that can be freed with the global allocator). It should not be in scope right now.

After thinking about it more, I think the fundamental dilemma of 3rd party integration comes down to the drop() impl. We'll need to call the correct drop method in order to return the buffer back to its pool or whatnot. This means we can't lose the type, or we need to be able to reconstruct it easily on a drop. This, IMO is the single best reason to make the BytesImpl trait public, and most other bits private (as private as possible)

You also make a great case regarding refcounting baggage, etc. That should also be up to the implementor.

I think I'm finally convinced that your design is the best approach.

That said, there are a few things we still need to nail down..

It is perfectly okay for the result of try_downcast() to be Err(Error("refcount > 1")) similar to arc/rc/refcell.

Semantically it's perfectly ok, but is it useful? One of the major use case of BytesMut is to reduce allcations on handling unsized data. To do so each allocations should be large enough to host about 4 to dozen expected size of BytesMut handles. Which means they rarely have refcount 1.

The use-case for try_unfreeze is almost always to reduce allocations in the BytesMut::with_capacity or from_vec_mut case.

They construct a BytesMut that performs its own allocation.
They copy their data into it.
Pass it to their socket::write call (which drops its copy when it's done).
Now they still have this allocated buffer in their (single) copy of Bytes. Why allocate a new one? Just convert that Bytes instance back into BytesMut and re-use it. There are 4 or 5 issues that requested that the Bytes -> BytesMut functionality be re-created after it was removed.

Understanding that it is a draft, I'd say that a final implementation should include the following: (in order of priority)

Any method that could cause an allocation should take a force: bool parameter, and return the newly constructed type or an Err(Error::WouldAllocate)
- The allocation requirement could be due to refcount > 1, or it requires some sort of conversion, etc.
  - The BytesImpl trait should have a method for Immutable -> Mutable conversion. (try_mut ?)
  - Don't let types convert themselves in place (like Promotable)
  - We can improve the efficiency, safety, and clarity of Promotable* by actually converting into SharedImpl on mutate/clone. (the BytesImpl contract will have to change slightly to reflect this)
  - This means that we would have to alter methods in the BytesImpl trait to accomodate for the alteration of types.
  - will_truncate
  - clone
  - to_mut(?)
- This should also help with Bytes <--> BytesMut conversions in 3rd party implementations.
Module Structure: It should change a bit to conform to API layering best practices.
- The trait would be consumed by Bytes/Mut. So Bytes and Bytes mute would be at the top of the layers, and the trait would live below.
- The trait impls would be siblings of Bytes/Mut, they could theoretically reference the Bytes/Mut types, but I'm not sure if they'd need to.
- This means that BytesImpl and implementors would know nothing about the Bytes/Mut types.
Naming:
- rename BytesImpl into SharedBuf or ManagedBuf or something more descriptive than BytesImpl
- "Impl" usually has a specific meaning in Rust (and other languages) and I don't think that this qualifies
Edit It would also be great to add support for try_unsplit as per #287 - This relates back to your use case, actually. If you split a buffer into a bunch of mutable buffers to pass to multiple writers, but now they're all done writing, and you want to recycle the buffer. try_unsplit was also explicitly mentioned in the scope of @carllerche 's meta ticket for VTable exposure.

rrichardson commented 1 year ago

You mention the downcasting approach for Bytes -> ByteMut conversion. I guess we should try to implement that for SharedImpl and see how it works.

rrichardson commented 1 year ago

I made a PR with my proposed module and naming changes: https://github.com/HyeonuPark/bytes/pull/2

I will make an additional PR following that with some takes at implementing try_resize and try_into_mut to support the unfreeze and unsplit methods.

rrichardson commented 1 year ago

One other side effect of this module refactoring is that it makes it very simple to put any allocations behind a feature flag.

zeevm commented 6 months ago

What's the status of this change? Seems stale for a year now, I'm waiting on this, is this abandoned?