Add a config to allocate the read buffer when vectored reads are enabled

sbauersfeld commented 1 year ago

Currently, when vector reads are enabled, this FUSE library stops providing the pre-allocated read buffer. We would like to use vectored reads, but the pre-allocated read buffer is still very useful for us.

Our application prefetches data from the cloud, but sometimes the kernel will ask for data faster than we can prefetch it, in which case we need a buffer to read the data into from over the network. Without using vectored reads, we can make use of the pre-allocated read buffer. This is very efficient because the pre-allocated read buffer uses extra space from the buffer that reads operations from the kernel.

This change proposes adding another config that would allow the ReadFileOp.Dst buffer to be allocated for vectored reads. The file system implementation can choose to add data to ReadFileOp.Dst if it desires, and it must append ReadFileOp.Dst to ReadFileOp.Data if it wants the data to be included in the response to the kernel.

sbauersfeld commented 1 year ago

Hi @stapelberg , I would really appreciate it if you could take a look at this PR when you get a chance!

stapelberg commented 1 year ago

Currently, when vector reads are enabled, this FUSE library stops providing the pre-allocated read buffer. We would like to use vectored reads, but the pre-allocated read buffer is still very useful for us.

I though the whole point of using vectored reads is to avoid using library-provided buffers. Are you saying using vectored reads is still faster, even without the buffer management difference?

cc @vitalif who contributed vectored read support

sbauersfeld commented 1 year ago

@stapelberg The benefit of using vectored reads is that file system implementations are not forced to copy data into the library provided buffer.

In our case, we sometimes have the data in memory already and enabling vectored reads allows us to skip a data copy. But sometimes we don't have the data in memory, and if we enable vectored reads, we will need to allocate our own memory to hold the data. This is a bit wasteful and slow in comparison to reusing the library provided buffer, which has already been allocated by the library.

vitalif commented 1 year ago

I'd also say it's a strange idea. Why can't you just keep a pool of allocated buffers if you want to avoid memory allocation cost? And in fact why do you want to avoid the allocation cost if you're anyway reading from the network? Reading from the network will be slower anyway. And you won't be able to reuse this buffer, it will only be used for 1 read...

sbauersfeld commented 1 year ago

Thanks for the reply @vitalif . It is not just about the memory allocation cost, its also about the total memory usage of our system. The library provided buffer memory has already been allocated, and it actually is not released. This library provided memory buffer is kept around and later reinserted into the freelists so it can be reused for the next operation. Also, this memory buffer is always at least 1 MB in size since it is allocated to be large enough to hold the largest possible message from the kernel.

Yes, we certainly can keep our own pool of 1 MB buffers, or just suffer the performance hit of frequently allocating and releasing memory buffers (which we have actually found to be non-negligible even with the network latency).

However, if we can simply reuse the library provided buffer, then we don't need to duplicate memory allocations, which reduces the overall memory usage and complexity of our system and increases performance.

sbauersfeld commented 1 year ago

Hi @stapelberg any more comments on this?

stapelberg commented 9 months ago

I’m willing to accept patches that result in jacobsa/fuse releasing memory earlier if possible, or to introduce size classes within the freelist to rightsize overall memory consumption.

But, the pull request in its current form seems too specific of a performance optimization to me.

Sharing buffers between the jacobsa/fuse package and its users only for performance reasons (as opposed to interchange data) increases complexity of the code and API surface too much for too little gain.

Therefore, I’ll close this PR for now. Do feel free to suggest different performance optimizations, though.

Thanks.

(Note for myself: I vaguely remembered Go code that assigned function arguments to local variables in order for the garbage collector to reclaim the associated memory more quickly, and I was wondering if we should pursue something like that here. However, that technique seems to be not required with Go 1.8+, so nevermind that.)

jacobsa / fuse

Add a config to allocate the read buffer when vectored reads are enabled #148