scylladb / seastar

High performance server-side application framework
http://seastar.io
Apache License 2.0
8.39k stars 1.55k forks source link

Sharing temporary_buffer across shards? #2450

Open travisdowns opened 2 months ago

travisdowns commented 2 months ago

Currently, it is not safe to use distinct temporary_buffer objects which share any underlying buffers across shards (aka "hidden sharing"). In practice, this generally means that is even unsafe to move any temporary buffers received from seastar to another shard, since these will often share underlying buffers with other temporary buffers (e.g., if they are derived from the same input stream). The only safe approach is to copy the underlying bytes in order to pass them to another shard.

This makes it very difficult to do "zero copy" manipulation of such buffers. A typical use case is that a connection is processing requests and these requests are bound for various shards (depending on the contents of the request, e.g., what partition is involved), and then we want to pass the entire payload of the request to the owning shard. This should be easy to implement in a zero-copy thread safe way: there is no concurrent use of the buffer at all, it is examined on one shard then the entire buffer is moved to another shard (ownership could stay with the original shard, or not, the key is there is no concurrent access) where it is accessed. However, the hidden sharing inside temporary_buffer makes this impossible as far as I can tell.

The only definite reason we can find that temporary_buffer has unsafe hidden sharing is that delete::impl::refs is accessed non-atomically.

Two questions:

1) Am I missing something about how to use distinct temporary_buffer objects across threads "zero-copy" without hitting this problem? For example I feel like ScyllaDB would run into the same unnecessary copy as described above unless it somehow manages to guarantee that the "connection" shard is the same as where the operation will need to be processed.

2) Is there any willingness to make delete::impl::refs a std::atomic<unsigned>? It seems to us this would solve the cross-shard hidden sharing problem, with the possible exception of deleters of type lambda_deleter_impl which have non-trivial lambdas (or lambda destructors) which expect to run on the same shard they were created on. Evidently, this would have a performance impact for all temporary_buffer use. That would look like something like this https://github.com/scylladb/seastar/pull/2451.

Maybe there's a better solution out there than (2).

avikivity commented 2 months ago

temporary_buffer is too low-level to directly support cross-shard traffic. That is, a request is likely to require many such objects and making all of them pay the price is too much.

I think it's better to re-architect so sharing happens at the request level rather than buffer level. That is, have some object that owns the buffers, and share it with make_foreign.

If you do want to share a temporary buffer, you can make a new temporary_buffer pointing at the same storage, but with a deleter that foreign_ptr-points to the original.

On original shard: the original temporary_buffer On new shard: new temporary_buffer + foreign_ptr\ as deleter.

travisdowns commented 2 months ago

We considered something like this, but I had thought was unsafe since foreign pointer only helps for deletion, but some other tb operations (e.g., .share()) also manipulate the reference count. However I guess your suggestion of a new deleter with an embedded foreign pointer may work.

Effectively, each shard has a unique copy of a "child" reference count, and when a shard's local reference count goes to zero it uses the foreign_ptr mechanism to update the "parent" reference count on the original shard.

It doesn't make sharing transparent (you still need to do the dance at the shard boundary) and it has some other downsides, but maybe it's better than making all tb transparent.

travisdowns commented 1 month ago

@avikivity wrote:

I think it's better to re-architect so sharing happens at the request level rather than buffer level. That is, have some object that owns the buffers, and share it with make_foreign.

This seems a bit hard to pull off because it's not only about destruction: even const operations on temporary_buffer may manipulate the reference counts: so as soon as you pass a temporary_buffer to another thread you are at risk that some operation you perform on it will manipulate the reference count in an unsafe way (and innocent changes to some methods may introduce it in the future). Furthermore, you definitely cannot let the temporary_buffer leak out of your request object because then it will almost certainly have its reference count manipulated at some point.

tgrabiec commented 1 month ago

@avikivity wrote:

I think it's better to re-architect so sharing happens at the request level rather than buffer level. That is, have some object that owns the buffers, and share it with make_foreign.

This seems a bit hard to pull off because it's not only about destruction: even const operations on temporary_buffer may manipulate the reference counts: so as soon as you pass a temporary_buffer to another thread you are at risk that some operation you perform on it will manipulate the reference count in an unsafe way (and innocent changes to some methods may introduce it in the future). Furthermore, you definitely cannot let the temporary_buffer leak out of your request object because then it will almost certainly have its reference count manipulated at some point.

You mean deleter::impl::refs? When migrating temporary_buffer to another shard, you create a new chain of deleters which is only manipulated on the other shard, e.g.:

    const seastar::temporary_buffer& buf_orig;

    auto del = buf_orig.share().release();
    temporary_buffer buf_on_other_shard(buf_orig.get_write(), buf_orig.size(),
                                        make_object_deleter(make_foreign(std::make_unique(std::move(del))));

    smp::submit_to(other_shard, [buf = std::move(buf_on_other_shard)] {

    });
avikivity commented 1 month ago

@avikivity wrote:

I think it's better to re-architect so sharing happens at the request level rather than buffer level. That is, have some object that owns the buffers, and share it with make_foreign.

This seems a bit hard to pull off because it's not only about destruction: even const operations on temporary_buffer may manipulate the reference counts: so as soon as you pass a temporary_buffer to another thread you are at risk that some operation you perform on it will manipulate the reference count in an unsafe way (and innocent changes to some methods may introduce it in the future). Furthermore, you definitely cannot let the temporary_buffer leak out of your request object because then it will almost certainly have its reference count manipulated at some point.

My suggestion works if the caller halts work on the data while the callee is processing. If there is concurrent work on shared temporary_buffers, it's unsafe.

You could do something like this:

  1. caller calls temporary_buffer::release() and holds on the the deleter
  2. caller constructs a new temporary_buffer referring to the same data, with a custom deleter that uses atomics or magic or whatever
  3. caller goes wild sending the new temporary_buffer to random shards
  4. callees manipulate the temporary_buffer copies with extreme prejudice
  5. caller waits for callees to calm down
  6. caller invokes the deleter (lets it get destroyed)

If the caller cannot hold on to the deleter for some reason, it could chain it to the custom deleter in step 2.

I discourage such games, they are likely to end with bad performance on large machines if you call many shards with the same t_b.

travisdowns commented 1 month ago

@tgrabiec wrote:

You mean deleter::impl::refs?

Yes.

When migrating temporary_buffer to another shard, you create a new chain of deleters which is only manipulated on the other shard, e.g.:

Yes, that suggestion I understand, but this deleter trick was was Avi's "if you really need to... but probably don't" suggestion. In my comment I was referring to the preferred approach of sharing at a higher level "request" as Avi had suggested. In this case, it was implied there is not need to mess with the deleters.

travisdowns commented 1 month ago

My suggestion works if the caller halts work on the data while the callee is processing.

Yes, but in the scenario we are considering this is almost impossible: the temporary buffers originate in the seastar network stack/input_stream stuff. So we don't know the extent of the sharing and these buffers are manipulated concurrently in the reactor loop due to things like more data arriving on the network etc. So once we get a buffer in the userspace, even if we sent it to another shard and take no further action until that is done, it seems unsafe because seastar internals may also manipulate it.

Even if there is some kind of strong guarantee about exactly when the reference counts will be manipulated by seastar and we can somehow rule it out, a common pattern would be to read say one buffer off of the wire, sent it to a shard for processing, then without waiting read a second buffer. Note at the application layer we may always wait for the other shard to finish processing a request/buffer, before manipulating it but here even manipulating different buffers for different request which originated on the same socket is unsafe because the buffers are already shared when they pop out of the input_stream.

This is not a complaint. I don't see an obvious solution: it is an obvious win for input_stream to work as it does and have sharing and that this is done in a shard local way makes sense for most use.

avikivity commented 1 month ago

My suggestion works if the caller halts work on the data while the callee is processing.

Yes, but in the scenario we are considering this is almost impossible: the temporary buffers originate in the seastar network stack/input_stream stuff.

The native one? Or posix?

In either case, my second suggestion will work (and indeed is safer, you keep the refcounts always local).

In fact, to be completely correct, you'd pass the submit_to gap using an std::span, and create the new temporary_buffer on the other side. Or even keep using std::span.

So we don't know the extent of the sharing and these buffers are manipulated concurrently in the reactor loop due to things like more data arriving on the network etc. So once we get a buffer in the userspace, even if we sent it to another shard and take no further action until that is done, it seems unsafe because seastar internals may also manipulate it.

Even if there is some kind of strong guarantee about exactly when the reference counts will be manipulated by seastar and we can somehow rule it out, a common pattern would be to read say one buffer off of the wire, sent it to a shard for processing, then without waiting read a second buffer. Note at the application layer we may always wait for the other shard to finish processing a request/buffer, before manipulating it but here even manipulating different buffers for different request which originated on the same socket is unsafe because the buffers are already shared when they pop out of the input_stream.

I can give a guarantee that it's not safe to manipulate shard-crossing temporary_buffers.

This is not a complaint. I don't see an obvious solution: it is an obvious win for input_stream to work as it does and have sharing and that this is done in a shard local way makes sense for most use.

I think the recipe in https://github.com/scylladb/seastar/issues/2450#issuecomment-2402744335 should work (with or without constructing a temporary_buffer on the callee side), as long as lifetimes are nested.

An alternative solution: make the buffer_allocator interface public, make sure it can be passed down to your data source, and customize it so it generates cross-shard-safe temporary_buffers. I don't recommend it though. And it probably doesn't work, the buffer chains will still be unsafe.

travisdowns commented 1 month ago

The native one? Or posix?

posix

In either case, my second suggestion will work (and indeed is safer, you keep the refcounts always local).

Yes, agreed. I just wanted to make sure I wasn't missing anything about the first suggestion.

I think the recipe in https://github.com/scylladb/seastar/issues/2450#issuecomment-2402744335 should work (with or without constructing a temporary_buffer on the callee side), as long as lifetimes are nested.

Agreed.

An alternative solution ...

I'd be pessimistic about the chances of upstreaming this.