Transfer data using tarpc

google / tarpc

An RPC framework for Rust with a focus on ease of use.

MIT License

3.17k stars 194 forks source link

Transfer data using tarpc #299

Open zy1994-lab opened 4 years ago

zy1994-lab commented 4 years ago

tarpc provides a nice interface to program so I want to use tarpc to transfer data among different machines in a distributed cluster. Apart from the 'normal' use case of rpc which given as the following

type GetFut = Ready<Vec<u32>>;
fn get(self, _: context::Context) -> Self:: GetFut {
        // something here
}

I would like to also have the following method:

type PutFut = Ready<()>;
fn put(self, _: context::Context, data: Vec<u32>) -> Self:: PutFut {
        // something here
}

In this way, I suppose I can both get and send data to the server. The vector in my use case can be very large. My question are: 1) Will the upload and download stream cause congestion in the same channel because the data size can be very large in both direction 2) How would this put method compare to, say, using a handwritten TcpStream or something like MPI?

I used tarpc before but I don't really understand how it work under the hood. Can anyone kindly help me? Thanks so much!

tikue commented 4 years ago

Hey! Sorry, I lost track of this. So in general, blob transfer is a hard problem that unary RPCs are not well suited to:

It is expected that a single request is small enough to fit in memory. For very large blobs, this is not true. Ram usage could very easily spike if you're, say, transferring a blu-ray video in a single request.
Load shedding is on a per-request basis: in tarpc, message deserialization typically happens before load shedding. Blobs can be arbitrarily large, which effectively breaks load shedding. A sufficiently smart transport layer could handle this better, perhaps by only reading X bytes before yielding back to the scheduler (tokio just had a blog post about stuff like this).
If you send only a small chunk in each RPC, you won't have the above problems, but then you won't necessarily have a guarantee that the chunks are hitting the same backend, e.g. if your client round-robins to multiple backends. This could be a problem if you're streaming to a file on a specific server; it's less of a problem if you know there's only one backend you could be talking to.

zy1994-lab commented 4 years ago

Thanks Tim. Right now I'm splitting a large file into small chunks and send them in multiple RPC requests. Because in my understanding, this won't cause too many troubles given the size of each RPC is reasonable. Some system like Timely Dataflow automatically break large payload into small batches during data transfer, is it possible to add this feature in tarps?

zy1994-lab commented 4 years ago

BTW, what's the maximum payload/frame size I can send using tarpc? Can I config this number?

tikue commented 4 years ago

Max payload/frame size is up to the transport to decide. For example, if you're using an in-memory channel that doesn't serialize requests or responses, you probably don't want to enforce a payload size. Many serde serializers support max serialization size, like bincode.