[request] scatter gather writes

liamstask commented 6 years ago

breaking out as distinct from #107

A scatter gather write interface (iovec or similar) is required in order to retain zero-copy functionality for some message formats (primarily considering capn proto).

gdamore commented 6 years ago

So under the hood, nanomsg has a scatter gather facility, but it isn't exposed in the message API, and nng_msg is not scatter/gather.

I'd like to understand more about this particular need -- usually scatter/gather is needed because headers are separate from message bodies. To that end, nng's messages are allocated with extra space at the front "headroom", and you can prepend data to them usually with no copy penalty.

I can imagine creating a richer data type (extended message?) with an actual iovec, but this will entail a bit of complexity, and is something I'd just assume not do unless there is demonstrated need.

Note that currently message structures in nng are not especially zero copy friendly (meaning you cannot just assign a large buffer you have from some other source to an nng message. Fixing this is something I'd like to undertake, but it's a little tricky -- especially for the receive side where we have to preallocate the buffers to receive into. For most protocols involving smaller message formats, this is not useful.

liamstask commented 6 years ago

My motivation is based on the ability to send capn proto messages zero-copy, but I assume this functionality could be generally valuable. I'll keep my description related to the requirements for capn proto.

Each message is backed by an arena allocator - as the length of the message grows beyond the size of the currently allocated segment, additional segments are allocated as appropriate. At write time, a collection of one or more segments is available, such that the writer needs to transmit both the content of the available segments but also some header information describing the number of segments and each of their sizes.

If scatter gather is not available, users must condense all segments into a contiguous buffer for transmission, incurring a copy.

Assuming all segments can fit into a unit of transmission of a given transport, receivers shouldn't need to be aware of the fact that the segments weren't originally allocated contiguously, as long as they arrive that way.

If there's a way to achieve this within the current nng_msg framework, that would be great!

gdamore commented 6 years ago

I will need to think about this a bit more. Right now, the message consists of two components -- a header and a body. Each of those components can have extra space at the front "headroom" or back ("tailroom") where the message can grow without reallocation or copying. For modest additions this is generally adequate -- but if you need to collect up large amounts of data in separate pieces, then you're going to really want a scatter/gather that is more than what I have implemented.

Unfortunately, the nng protocol layer stuff makes some rather blithe assumptions about the structure of messages, and the message API itself lacks the kind of richness you'd really want. Redesigning to accommodate true scatter/gather may be a lot more work.

(Also, as a little note -- we currently have a fixed limit of only 4 segments in the underlying AIO's iovecs. This is because the iovs are statically allocated. With some protocol combinations, we do wind up using all or nearly all four of those. Extending the AIO to accommodate more (for example 8) would not be hard, but making it much larger or unbounded would be rather challenging.)

I'm also thinking about your "arena allocator" -- if you're talking about send (you must be), then you should know that at present we don't really support true "zero copy" -- meaning we copy the data anyway. (Doing true zero copy would be interesting, but its also a "hard" problem, because of message ownership challenges -- we would need a way to call the user code back to actually free the message when we are done with it. For most cases this is sufficiently complex -- usually including extra locking -- that simply copying the data into the contiguous message is faster and simpler.)

gdamore commented 6 years ago

Perhaps the message can carry an iovec. The challenge in that case will figuring out managing the life time of both the iovec and the memory pointed to. Right now messages "own" their own data, but in this case that would not be true.

If the callbacks can free up the iovecs, then it gets simpler.

liamstask commented 6 years ago

I think an implementation that involves transfer of ownership to nng upon calling write would work for the case I have in mind 👍

user messages can be constructed in chunks, ownership of which could be transferred to nng when when writing
- perhaps it would be most flexible to allow for some flags that indicate ownership of the data in the iovecs, to handle truly static data, but i think an implementation that assumes taking ownership would still provide more flexibility than the current api
perhaps nng_msg could statically provide some number of iovecs (4? 8?) for user data to avoid allocating the list of iovecs itself for shorter messages, and fall back to heap allocation of a larger list if necessary

and yea, re: 'zero-copy', agreed that it would be a lot of work to ensure zero copies all the way through transmission from the kernel, but mainly what i have in mind is whether nng's api supports zero copy functionality from user code - ie, does the data need to be copied in order to be coalesced into the single 'body' buffer, or can it be submitted in its original form?

probably best to discuss as a separate issue, but if nng takes responsibility for freeing messages, it might also be interesting to consider providing some message pooling functionality to help avoid hitting new/malloc when building messages.

gdamore commented 6 years ago

I’ve been thinking quite a bit more about this. Right now the iovec list in aios is only 4 elements long, but I am considering making it heap allocated, or an alternative like you mentioned.

I’m still contemplating ideas around data member ownership.

On Sun, Jan 28, 2018 at 9:23 PM Liam Staskawicz notifications@github.com wrote:

I think an implementation that involves transfer of ownership to nng upon calling write would work for the case I have in mind 👍

user messages can be constructed in chunks, ownership of which could be transferred to nng when when writing

perhaps it would be most flexible to allow for some flags that indicate ownership of the data in the iovecs, to handle truly static data, but i think an implementation that assumes taking ownership would still provide more flexibility than the current api

perhaps nng_msg could statically provide some number of iovecs (4? 8?) for user data to avoid allocating the list of iovecs itself for shorter messages, and fall back to heap allocation of a larger list if necessary

and yea, re: 'zero-copy', agreed that it would be a lot of work to ensure zero copies all the way through transmission from the kernel, but mainly what i have in mind is whether nng's api supports zero copy functionality from user code - ie, does the data need to be copied in order to be coalesced into the single 'body' buffer, or can it be submitted in its original form?

probably best to discuss as a separate issue, but if nng takes responsibility for freeing messages, it might also be interesting to consider providing some message pooling functionality to help avoid hitting new/malloc when building messages.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/nanomsg/nng/issues/167#issuecomment-361143215, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPDfTLd_ixVCkahcYhSlmgiY6LQB7moks5tPVW7gaJpZM4RNMJm .

gdamore commented 6 years ago

So aios now have a "long" aio list (at least 16, on some platforms up to 64). We have to apply some limits because on some systems the underlying scatter/gather I/O doesn't support longer lists. (POSIX mandates support for at least 16 for example.) Also, we use stack allocations in a few places to avoid having to allocate larger numbers...

gdamore commented 4 years ago

And now, I've eliminated the long (up to 64) iov support -- we support up to 16 everywhere. We still need to figure out how to expose this nicely.

gdamore commented 4 years ago

Btw, when using the lower level stream APIs, IOVs are available. nng_stream_send() and nng_stream_recv() each use an iov in the AIO.

nanomsg / nng

[request] scatter gather writes #167