Closed thomaseizinger closed 1 year ago
The number of streams is probably the thing that peers will impose a limit on, not the number of pending streams. So the "pending stream budget" seems like the wrong thing to measure.
A global stream limit is not that useful to begin with. What we need in libp2p is a limit per protocol. This would require a coordination mechanism running on top of a stream muxer.
The number of streams is probably the thing that peers will impose a limit on, not the number of pending streams. So the "pending stream budget" seems like the wrong thing to measure.
Thanks for the input!
At least from an implementation PoV, I wouldn't immediately know, what a useful starting point for the max number of streams is and without an extension to the messages in yamux, we can't communicate a limit directly I think.
In case both nodes are equally fast in processing streams, it seems okay to me to have them open more streams. That is already the situation today. Also see https://github.com/libp2p/rust-libp2p/discussions/3039 which started this thought process.
Setting a limit on the pending streams only triggers if the receiving node slows down in processing.
A downside of this proposal is that I think we can never increase the budget so limiting the total number of streams seems too much.
A global stream limit is not that useful to begin with. What we need in libp2p is a limit per protocol. This would require a coordination mechanism running on top of a stream muxer.
To clarify my intention: I am looking at this from a bugfix point of view. What is the best value for money solution so we can get some kind of backpressure into yamux without investing heaps of time? It is a legacy muxer from a spec PoV but still widely used at this point (i.e. every rust-libp2p deployment unless they are using mplex for some reason).
I’m not yet convinced of adding some sort of back-pressure to yamux in such a fashion: existing nodes will still translate the back-pressure signal as UnsupportedProtocols, leading to false failures in the existing node, thus breaking network compatibility. This looks like a bug that cannot be fixed without switching to a new muxer (which could be a rebranding of yamux, if need be).
I’m not yet convinced of adding some sort of back-pressure to yamux in such a fashion: existing nodes will still translate the back-pressure signal as UnsupportedProtocols, leading to false failures in the existing node, thus breaking network compatibility. This looks like a bug that cannot be fixed without switching to a new muxer (which could be a rebranding of yamux, if need be).
The above proposal does not intentionally drop streams.
What it does is to stop opening new streams in case the other party starts to drop streams. Once https://github.com/libp2p/rust-libp2p/pull/3071 is fixed, no streams will be dropped unless one node can actually not keep up with the load (i.e. is working at 100% accepting new streams).
This would be shipped completely independent of any buffering implementation. With https://github.com/libp2p/rust-yamux/pull/142, we are getting a Poll
-based interface for opening new streams. What this spec-extension does is making use of such an interface to back-pressure to the caller that no more streams can be opened at the moment.
We seem to have a different perspective on network compatibility. A new version could, of course, stop opening new streams once it concludes that that is a good idea, but existing versions won’t do that — they will just understand “unsupported protocol” and draw incorrect conclusions from that. There is nothing you can do to change that.
We seem to have a different perspective on network compatibility. A new version could, of course, stop opening new streams once it concludes that that is a good idea, but existing versions won’t do that — they will just understand “unsupported protocol” and draw incorrect conclusions from that. There is nothing you can do to change that.
An implementation that draws conclusions about dropped streams by the remote (i.e. this proposal) does not necessarily have to drop streams itself. That is orthogonal.
I originally intended to ship this in the following order:
This by itself is already a fairly good measure, assuming your peer does not apply any explicit backpressure but just tries to process streams as fast as possible. If your peer is faster then theirs, the ACKs for existing streams will naturally pile up and you will eventually stop opening streams until they have answered them. My suggestion for implementing this would be to wait for all streams to be ACKd before we start opening new ones.
version
header field that is set to 0. Unfortunately, we can't just set that to something else because existing implementations will just fail to parse that.multistream-select
:
/yamux/1.0.0
implies yamux
version 0./yamux/2.0.0
imply yamux
version 1.yamux
version 1 with an additional type
: "Stream Buffer Update".We can roll this out in a backwards-compatible way by first trying to negotiate /yamux/2.0.0
and then falling back to /yamux/1.0.0
in case the other peer does not support it.
[^1]: Actual number to be agreed upon.
Yes, this matches what I meant above by “rebranding of yamux”. I haven’t yet thought through all the details — if you consider error handling then the scheme to be implemented will likely be similar to Reactive Streams (considering substream requests as onNext
payload).
Yes, this matches what I meant above by “rebranding of yamux”. I haven’t yet thought through all the details — if you consider error handling then the scheme to be implemented will likely be similar to Reactive Streams (considering substream requests as
onNext
payload).
Thank you for linking this. I think our current interface almost matches this:
onSubscribe
: Implied by opening a yamux connectiononNext
: We return Ok(stream)
from poll_inbound
onError
: We return Err(e)
from poll_inbound
. Any StreamMuxer
implementation is considered to be unrecoverably failed as soon as an error is returned.
onComplete
: Not represented at the moment, usually in the form of Option
in the Stream
interface so we could change the return value of poll_inbound
to Result<Option>
.The backpressure bit (returning Poll
) is already supported. Do you see any limitations in the current interface for why we couldn't implement the above proposal? From my PoV, we mostly need to figure out the network layer bits.
If I understand correctly, this proposal has now changed to rolling out a new yamux version. That makes sense, since it’s the only way to solve the problem that doesn’t rely on extrapolating from incomplete information.
However, rolling out a new muxer version is really no different from rolling out a new muxer. I can think of a lot more things I’d like to change about yamux than just adding the “Stream Buffer Update” (and I’m not convinced it’s the easiest / most elegant solution for the problem). In fact, we already have a proposal for a more well-behaved stream muxer: qmux (#394).
Operationally, I’d consider #446 a prerequisite for rolling out a new muxer. Paying a 1 RTT penalty if the peer doesn’t support the muxer is probably prohibitive, and updating the network to a state where you can just assume that the new muxer (version) is support by the vast majority of nodes takes a long time.
Priority-wise, I’m a bit skeptical of both approaches (that’s why progress on #394 has stalled). Our resource control problem would much better be solved by introducing per protocol stream limits than by a global stream limit. Furthermore, the vast majority of our connections are QUIC connections anyway, so any improvement on the muxer side would only apply so a small minority of connections.
I agree with @marten-seemann, those are several good points. One aspect I see differently concerns the layering of back-pressure: I think there is a place for limiting new substreams regardless of protocol, purely at the muxer level. When a new substream is accepted and its protocol negotiated, the respective handler will need to allocate resources for it, so it makes sense to add a back-pressure signal from handler to muxer as well. However, this is a bigger change in the ecosystem, since now a protocol may be temporarily unavailable on a connection. This deserves a different response code than a permanent negotiation failure to allow the other side to come back later. Communicating a per-protocol budget for new substreams between peers seems excessive to me, I’ll come back to this point.
@thomaseizinger The crucial point of Reactive Streams is not in the semantics of onNext/onError/onComplete (those are present already in the non-back-pressured Rx.Net ancestor), it is the semantics of Suscription.request()
. To make the system work in the intended dynamic push–pull fashion, the consumer needs to dynamically update the provider’s budget for sending more messages — communicating a buffer size is not enough. While messages flow downstream, demand needs to flow upstream. We opted for batching demand (i.e. making it into an integer instead of sending an Ack per message) to make this efficient also across communication channels with higher latency.
Coming back to the last point of the first paragraph: this is why I think this principle should be employed on the muxer level for all new substreams. Propagating demand for all possible protocols is too costly, complex, and in some cases not even possible (since e.g. two protocol names may share the same providing resources).
@marten-seemann
However, rolling out a new muxer version is really no different from rolling out a new muxer.
There is a difference in how much work you need to invest in creating that new muxer (in each language) and agreeing what its functionality should be etc. That is likely a timeline of months.
updating the network to a state where you can just assume that the new muxer (version) is support by the vast majority of nodes takes a long time.
That depends entirely on which network you are looking at. IPFS? Probably. A smaller deployment that consists entirely of rust-libp2p based nodes[^1]? Likely easier.
Furthermore, the vast majority of our connections are QUIC connections anyway, so any improvement on the muxer side would only apply so a small minority of connections.
This is only true for go-libp2p
. Anything big that is rust-libp2p based, like polkadot and Ethereum2 is almost entirely TCP + yamux based as we only just released an alpha of QUIC a few days ago.
Like I said above, my intention here is a low-effort high-impact solution. The suggested changes are fairly trivial to implement on the rust-yamux
side and would benefit users that have control over their deployment plus any user that is happy to accept a potential 1 RTT penality in exchange for back-pressure on the muxer level.
I think there is value in doing that, perhaps for now only on the Rust side of things if it benefits some of our users.
[^1]: If I remember correctly, this is your situation @rkuhn, right?
@thomaseizinger The crucial point of Reactive Streams is not in the semantics of onNext/onError/onComplete (those are present already in the non-back-pressured Rx.Net ancestor), it is the semantics of
Suscription.request()
. To make the system work in the intended dynamic push–pull fashion, the consumer needs to dynamically update the provider’s budget for sending more messages — communicating a buffer size is not enough. While messages flow downstream, demand needs to flow upstream. We opted for batching demand (i.e. making it into an integer instead of sending an Ack per message) to make this efficient also across communication channels with higher latency.
Can you explain this again mapped to our problem domain? I am struggling with the abstract description around reactive streams.
Here is my proposal again, as an example:
Assuming peer A opens 10 streams. The number of pending (no ACK received yet) streams is 10. Peer B processes 3 streams and thus sends 3 ACK frames back. It read all 10 SYN frames but because the upper application has only pulled 3 streams, 7 remain in an internal buffer.
To exercise back-pressure, we need to limit the number of pending streams that peer A can have on their end. Say we configure our yamux client with max pending streams 20. In the above example, peer A would have then a budget of 13 more streams (7 are still pending out of 20 max). Every time a stream is consumed on peer B, a SYN is sent and thus the budget is increased again.
My proposal allows for this number to be dynamic. We could also hard-code some value on the producer side (or let the user configure it). It would result in largely the same outcome except that we have to "trust" the other party that they are honoring the limit. If we send the limit to the other party, we can detect a protocol violation and close the connection.
Perhaps this would be an even better change to make, at least to start with. It is fully backwards-compatible if we use an unbounded buffer on the receiver side.
Disclaimer: for the following I assume that /yamux/2.0.0
is the way to go.
What you describe is similar to a particular way of using Reactive Streams: have an initial budget (implied) and piggy-back the interpretation request(1)
onto each ACK frame. This could work in a static setting, i.e. without stream buffer updates. Since the network messages are sent already for other reasons, it would also be efficient.
Dynamic updates would only work in one direction, though, because a request(N)
signals the commitment to accept N more messages — going back on that promise is not possible without further costly coordination between the parties. If for example one side were to send a negative stream buffer update while the other side has recently sent yet unreceived requests to use up the previous budget, then a conflict occurs where either side has a right to claim protocol conformance and accuse the other of breaking it.
I agree that it is interesting to consider compatible improvements. As we seem to agree on, this can only work by voluntarily limiting the request sender (as no quota or enforcement thereof has been built into existing nodes). Without a means to communicate limits, the sender would have to assume some default limit and stick to it. Meanwhile the receiver needs to be prepared to receive an unbounded rate of new substream requests — which will eventually lead to failures (that should be confined to one connection if possible, by using quotas).
It is crucial to remember that in this world there is no back-pressure per se, just polite clients and failures. This will lead to resource quotas differing substantially from back-pressured stream buffer sizes.
What you describe is similar to a particular way of using Reactive Streams: have an initial budget (implied) and piggy-back the interpretation
request(1)
onto each ACK frame. This could work in a static setting, i.e. without stream buffer updates. Since the network messages are sent already for other reasons, it would also be efficient.Dynamic updates would only work in one direction, though, because a
request(N)
signals the commitment to accept N more messages — going back on that promise is not possible without further costly coordination between the parties. If for example one side were to send a negative stream buffer update while the other side has recently sent yet unreceived requests to use up the previous budget, then a conflict occurs where either side has a right to claim protocol conformance and accuse the other of breaking it.
This race condition exists but I'd argue that only a naive implementation would be triggered by it. A stream buffer update can only come into effect (and thus trigger a protocol violation) after the number of pending streams has dropped below the new value (or maybe even 0).
I'd consider this an edge-case though: I wouldn't know under which condition a client changes their configured stream buffer size at runtime. Really, the only reason why we even send this value is because we can't assume one from the other party (with 0 being the only safe exception). I'd expect implementations to send this value only once and have it remain constant throughout the session.
I agree that it is interesting to consider compatible improvements. As we seem to agree on, this can only work by voluntarily limiting the request sender (as no quota or enforcement thereof has been built into existing nodes). Without a means to communicate limits, the sender would have to assume some default limit and stick to it. Meanwhile the receiver needs to be prepared to receive an unbounded rate of new substream requests — which will eventually lead to failures (that should be confined to one connection if possible, by using quotas).
It is crucial to remember that in this world there is no back-pressure per se, just polite clients and failures. This will lead to resource quotas differing substantially from back-pressured stream buffer sizes.
Perhaps that is good enough though? I wonder whether that would add delay and if so, how much.
My takeaway from this is that going for /yamux/2.0.0
is not worth it. Instead, I want to experiment with making the rust-yamux
implementation behave nice by only opening a new stream once the last one has been ACK'd. It would be interesting to try and measure this but I couldn't yet think of a useful metric here. Perhaps the "backpressure time" could be useful, i.e. for how long we prevent our user from opening a new stream until they can finally open it. I am imaging this as:
open_outbound
Poll::Pending
because the last one hasn't be ACK'd yetopen_outbound
again (likely because they got woken)Poll::Ready
and stop the timerInstead, I want to experiment with making the rust-yamux implementation behave nice by only opening a new stream once the last one has been ACK'd. It would be interesting to try and measure this but I couldn't yet think of a useful metric here.
That will cause a huge performance regression, depending on the application protocol. Imagine HTTP layered on top of libp2p, and the user loading a website with 50 sub-resources. This would now take 50 RTTs instead of a single RTT. As a point of reference, go-yamux uses an accept backlog of 256 in its default configuration.
As a point of reference, go-yamux uses an accept backlog of 256 in its default configuration.
What happens when this limit is reached?
Yes, that is a good question; also: do you @marten-seemann mean a backlog in the receiver or a quota in the sender?
@thomaseizinger I don’t think we’re on the same page regarding what’d be broken with dynamically lowering buffer size, but that is irrelevant if we take the compatibility route. In that latter case I don’t think we need to dial up the politeness level to ask for permission for every single substream. Having any fixed pending ack budget would be an improvement already, so the bound can also be greater than one.
The accept backlog is the number of streams that we have opened but haven't received an ACK for. Once that limit is reached, we won't allow opening new streams until an ACK is received.
Quick comment on terminology: substreams are not a thing (when it comes to libp2p specs). It's called streams. :)
The accept backlog is the number of streams that we have opened but haven't received an ACK for. Once that limit is reached, we won't allow opening new streams until an ACK is received.
Okay, so that is what I called "pending streams" above. Good to know the go implementation has this feature. I am planning to introduce the same on the Rust side, will also go with a default of 256 then!
Quick comment on terminology: substreams are not a thing (when it comes to libp2p specs). It's called streams. :)
Yeah, rust-libp2p
uses substream in certain places, particularly in the context of muxers 🙈
"Stream" means something very specific in Rust: https://docs.rs/futures/latest/futures/prelude/trait.Stream.html
I'd love to be consistent across the ecosystem here but I am not sure we can transition to that fully without causing a lot of ambiguity :/
Side note: I think it would be worth writing a "spec" document in here that outlines how we use yamux specifically and f.e. recommends an accept backlog of 256 etc.
I've now implemented this in rust-yamux: https://github.com/libp2p/rust-yamux/pull/153
tl;dr: I think I have a solution for back-pressure.
In the above PR, I implemented two tests:
The combination of the above two solves back-pressure assuming all implementations agree on the same magic number 256 and each implementation has a buffer of 256 streams. In case node A is slower than node B, the buffer on node A will eventually be full. Because the application hasn't written to any of the streams yet, they are not acknowledged. Once it is catches up and processes a new stream, it will acknowledge it node B which gives B another credit for opening a new stream whilst at the same time having a buffer slot on node A.
This hits a sweet spot in my eyes. We don't need to roll out a new muxer to networks but still manage to not drop streams between compliant implementations. The only downside is that the limit isn't configurable but I think that is acceptable for the benefit of introducing this in a backwards-compatible way.
I'd consider this issue to be solved if we extend the spec with:
Buffering inbound streams
Implementations MUST buffer at least 256 unacknowledged inbound streams. Implementations MUST NOT implicitly acknowledge streams but wait for the application to send the first
DATA
frame.
@marten-seemann Let me know what you think!
I'm surprised that this is still considered a problem worth solving. I thought we had agreed to not pour more engineering time into yamux.
The combination of the above two solves back-pressure assuming all implementations agree on the same magic number 256 and each implementation has a buffer of 256 streams.
That's not a valid assumption. 256 is a pretty high number, and I'd be opposed to requiring all implementations to support that many.
I'm surprised that this is still considered a problem worth solving. I thought we had agreed to not pour more engineering time into yamux.
It is a problem on the rust-libp2p
end because it means protocol implementations like kademlia cannot rely on the underlying muxer to support back-pressure and have to come up with their own magic constants like only allowing a maximum of 16 concurrent streams.
Regardless, most of the work (which wasn't much) was to make yamux follow the go implementation to also support the ACK backlog of 256.
The combination of the above two solves back-pressure assuming all implementations agree on the same magic number 256 and each implementation has a buffer of 256 streams.
That's not a valid assumption. 256 is a pretty high number, and I'd be opposed to requiring all implementations to support that many.
I went off your comment here: https://github.com/libp2p/specs/issues/471#issuecomment-1345970310
I guess with the default window size of 256KB, this could amount to 256KB * 256 which is 64MB of buffered data. However, aren't you implicitly requiring this already anyway with an ACK backlog of 256? What is the point of opening up to 256 streams if you don't expect the other party to buffer them until it is ready to process them?
I'm surprised that this is still considered a problem worth solving. I thought we had agreed to not pour more engineering time into yamux.
It is a problem on the
rust-libp2p
end because it means protocol implementations like kademlia cannot rely on the underlying muxer to support back-pressure and have to come up with their own magic constants like only allowing a maximum of 16 concurrent streams.
Yes, you can't, and you shouldn't. Relying on your muxer to provide backpressure is the wrong thing to hardcode without https://github.com/libp2p/specs/issues/520 anyway.
The combination of the above two solves back-pressure assuming all implementations agree on the same magic number 256 and each implementation has a buffer of 256 streams.
That's not a valid assumption. 256 is a pretty high number, and I'd be opposed to requiring all implementations to support that many.
I went off your comment here: #471 (comment)
I'm not opposed to that number in go-libp2p, but making it a requirement for all implementations is a completely different story.
The combination of the above two solves back-pressure assuming all implementations agree on the same magic number 256 and each implementation has a buffer of 256 streams.
That's not a valid assumption. 256 is a pretty high number, and I'd be opposed to requiring all implementations to support that many.
I went off your comment here: #471 (comment)
I'm not opposed to that number in go-libp2p, but making it a requirement for all implementations is a completely different story.
I am happy to change the wording to SHOULD instead of MUST. Does that work for you?
Documenting an out-of-band discussion with @marten-seemann here. Unless I am missing something https://github.com/libp2p/rust-yamux/pull/153 only implements what go-libp2p-yamux does already today, that is tracking a stream accept backlog and adding an upper limit on the outbound side. It does not introduce a new mechanism, nor does it introduce an incompatibility with any existing implementation. Please confirm @thomaseizinger.
I'm with @marten-seemann on not including new requirements (like MUST and MUST NOT) into the Yamux specification. Even though a multi-stage rollout strategy would theoretically be possible, it seems like too much effort. In my opinion, our time is better spent focusing on QUIC. It seems like we're all on the same page here, @thomaseizinger and @marten-seemann.
I am happy to change the wording to SHOULD instead of MUST. Does that work for you?
Adding a recommendation (not requirement) to the specification sounds good to me. As far as I understand this would just update the specification to reflect the status-quo, that is what go-libp2p already does today. Similar to how https://github.com/libp2p/specs/pull/519 documented the outbound side of the ACK-backlog, we can document the inbound side of the ACK-backlog.
Replacing MUST and MUST NOT with SHOULD and SHOULD NOT in @thomaseizinger suggestion above works for me:
Implementations SHOULD buffer at least 256 unacknowledged inbound streams. Implementations SHOULD NOT implicitly acknowledge streams but wait for the application to send the first
DATA
frame.
The entire ACK-backlog section would then read as following:
Implementation considerations
ACK backlog
Yamux allows for a stream to be opened (and used) before it is acknowledged by the remote. The ACK backlog is defined as the number of streams that a peer has opened which have not yet been acknowledged. Implementations SHOULD at most allow an ACK backlog of 256 streams. Implementations SHOULD buffer at least 256 unacknowledged inbound streams. Implementations SHOULD NOT implicitly acknowledge streams but wait for the application to send the first
DATA
frame.
Yamux does not offer any form of control in how many streams can be opened by a remote.
This has been an annoyance for a while and transports like QUIC are useful because they do support managing the number of open streams.
I'd like to propose what I believe is a backwards-compatible extension of the yamux spec that gives us some amount of back-pressure for streams.
The yamux specification states for opening of a new stream:
My proposal is as follows:
I think this is backwards-compatible because existing implementations already need to handle receiving
RST
for a stream at any point. What we are doing here is adding meaning to receivingRST
whilst we have several pending substreams. This allows us to implement a bounded buffer on the receiver side where any streams exceeding that buffer will send aRST
and be dropped. This means we will have 1 failed stream per physical connection that allows us to learn the other party's buffer size.cc @mxinden @marten-seemann @rkuhn