Duplicate memory of write data

liyinshubyte commented 2 years ago

When I use ganesha to write data, if the backend write slowly, ganesha will exhaust the thread number, and memory will be high which main come from the write data. For example, if max thread number is 10000, each write request has 1MB data, the write data memory is 10GB when all thread is busy. But I find the memory of write data is more than 20GB.

The reason is there are two copy memory of write data, the first comes from: svc_rqst_xprt_task_recv->svc_vc_recv->xdr_ioq_uv_create->gsh_malloc, the second comes from: svc_rqst_xprt_task_recv->svc_request->nfs_rpc_process_request->xdr_COMPOUND4args->xdr_array_decode->xdr_nfs_argop4->xdr_WRITE_SAME4args->xdr_bytes_decode->gsh_malloc. The two copy memory will be freed when finish compound, but I think the first copy memory could be freed when finish xdr_bytes_decode, then we will save about 10GB memory if the write request is slow.

ffilz commented 2 years ago

Yes, that's another thing we need to work on. Ideally we would just point the write data at the original buffer.

And at some point, we need to support a vector of buffers (the FSAL I/O code hand;es vectors).

liyinshubyte commented 2 years ago

@ffilz Thanks, I will work on it.

ffilz commented 1 year ago

We're going to prioritize working on this for V6, have you done anything with it?

liyinshubyte commented 1 year ago

@ffilz sorry, I still have no available time on this, you can continue to work on this.

ffilz commented 1 year ago

There are also extraneous data copies in the READ path.

For the READ path, we could re-structure the READ response to allow incorporating buffers passed from the back end filesystem. Doing so would require care that the buffers not be modified (which means a data copy WOULD become necessary if we are doing krb5p since we can't then encrypt in place). But the solution should make it clear if the buffer passed to RPC can be modified or not, so krb5p only copies if the buffer is read-only. Ultimately, the XDR encoded response would become an iov that includes the back end filesystem buffer for the READ data.

One the WRITE side, we will get the request into an iov and should decode that into filling in the request structure but creating an iov that maps the WRITE data chunks from the original iov the request was received by RPC into. Then these buffers should be passed all the way to the back end filesystem as an iov and preferably no copy done on the way there.

The one challenge is that if the physical I/O on either end requires using hardware buffers, we might have to do data copies and it may be tricky and undesirable to allow Ganesha to "own" the hardware buffers during processing. But outside copy to/from hardware buffers we should be able to eliminate any other data copies outside a need to copy for encryption (fortunately we already are structured such that integrity with krb5i does not require a copy, we checksum in place and put the checksum in a separate buffer in the iov).

ffilz commented 3 months ago

Closing as done with 6.0 release.

nfs-ganesha / nfs-ganesha

Duplicate memory of write data #861