Question: Sharing a WebTransport connection in multiple workers?

aboba commented 2 years ago

There are applications where WebTransport may be used along with other APIs that can consume significant CPU. For example, in a sender or receive pipeline there can be Machine Learning operations or encode/decode.

In these scenarios, it may be helpful to have the send pipeline in one worker and the receive pipeline in another.

What ways might be available to utilize a WebTransport connection from multiple workers? Would any of these be considered "best practice"?

aboba commented 2 years ago

A scenario where multiple workers might be needed is transmission of high resolution/high framerate video encoded with a complex codec (e.g. AV1). In this scenario, the encoder may cause CPU utilization close to 100 percent in the worker, causing high latency throughout the pipeline. Splitting the receiving and sending pipelines in separate workers might help.

jan-ivar commented 2 years ago

it may be helpful to have the send pipeline in one worker and the receive pipeline in another.

The app can use transferable streams which have "the writable side in one realm and the readable side in another realm".

This should work for simple pipelines at least, e.g.

const wt = new WebTransport(url);
const {readable, writable} = await wt.createBidirectionalStream();
sendWorker.postMessage({writable}, [writable]);
receiveWorker.postMessage({readable}, [readable]);

If I read this tweet from @domenic right, this should (in theory) suffice to let user agents optimize things so that bytes never need touch the JS thread that created the connection. That's if I read things correctly of course.

A scenario where multiple workers might be needed is transmission of high resolution/high framerate video

But for our (now popular) "partial reliability" pattern where apps create anywhere from 1 to 30+ short-lived streams per second for each media segment or frame it wishes to send, doing this transfer dance per frame or segment seems sub-optimal. A stream-of-streams wouldn't work any better either AFAIK, or at least not change the semantics to what I think we'd want (which would be closer to a stream of fully-buffered byte chunks, not the hooking up of 30+ little data tunnels per second).

It makes me question A) whether our stream-of-streams remains the best way to expose fully-received frames or media segments, or B) whether implementations are allowed to optimize by treating a stream-of-fully-received-and-closed-byte-steams as a stream-of-byte-chunks, for the sake of transfer.

aboba commented 2 years ago

Yes, "simple pipelines" (e.g. bidirectional or unidirectional streams) are tractable, but the message/stream case is more complicated. As you say, transferring 30+ streams/second seems suboptimal. Am looking at workarounds, such as transferring a "stream of streams" (for reading) or transferring constructed send or receive streams.

aboba commented 2 years ago

Tried constructing readable and writable streams on main thread and then transferring them to the receiveWorker and sendWorker, respectively. The performance is bad (frame latencies of 1700 ms as compared with 100 ms with a single worker). Also, the tab crashes.

Not surprised that this doesn't well, because the receive and send streams are inextricably tied to the WebTransport, which isn't being transferred. Also, tried creating and transferring "stub" transform streams, so I didn't have to transfer the sendStream and receiveStream. If I start the pipeline before transfer, the following error results: Failed to execute 'postMessage' on 'Worker': A TransformStream could not be cloned because it was locked. If I start the pipeline after transfer, I get a different error.

The sample (which requires Chrome Canary 108+) is here.

jan-ivar commented 2 years ago

The performance is bad

Are you observing send or receive? Let's look at send first:

the receive and send streams are inextricably tied to the WebTransport, which isn't being transferred.

The write algorithm says: "6. Return promise and run the remaining steps in parallel. 7. Send bytes on stream.[[InternalStream]] and wait for the operation to complete. This sending MAY be interleaved with sending of previously queued streams and datagrams, as well as streams and datagrams yet to be queued to be sent over this transport."

IOW, network sending happens in parallel (on a different thread) already, which means it isn't inextricably tied to main.

The next question is what thread the write algorithm itself happens on. The create algorithm says: "Set up stream with writeAlgorithm set to writeAlgorithm, closeAlgorithm set to closeAlgorithm, abortAlgorithm set to abortAlgorithm."

IOW, it is called from the streams algorithms, which I think means that if the immediate stream is transferred, OR if pipeTo is used to provide it data from a stream that was transferred then this shouldn't need to touch main thread either.

And just because user agents can optimize this doesn't mean they are, so we should be careful inferring what can and cannot be done from existing implementations.

But regardless, if this is/were optimized, it might matter had this been a long-lived stream, however, with one such setup per frame, it's not clear to me the setup cost per frame is worth offsetting the delivery of one frame to another thread.

aboba commented 2 years ago

I am constructing the WebTransport on main thread. Then I'm calling createSendStream on main thread and transferring the returned writable stream to the send_worker. After that I'm calling createReceiveStream on main thread and transferring the returned readable stream to the receive_worker.

Presently, I don't have metrics on the send and receive pipeilnes individually, only observed glass-glass latency and measured frame RTT. Both of the metrics were worsened by putting send and receive pipelines into their own threads, with the WebTransport created on main thread. Plus the tab became unstable.

As you say, the complex "send" stream creates lots of streams and the complex "read" stream receives lots of streams while the transport remains on main thread so there is a lot of transfer overhead involved. So perhaps the bad performance should have been expected. But I was a bit surprised by the instability. I probably need to understand better how the transfer is implemented. For example, there are places where the complex streams read and write to buffers. This seems to relate to the instability because after I added a write, the tab went down immediately, instead of waiting a few seconds. So perhaps I'm doing something that causes an illegal memory operation.

The next thing I am trying now is keeping the complex "send" and "receive" streams on main thread, but transferring other parts of the pipeline, such as serialization (which is piped to sendStream) and deserialization (which receiveStream pipes to). So far I'm not having much luck with that. If I start the pipes on main thread and then attempt a transfer, I get an error that I can't transfer a locked stream. If I transfer first, I get another error.

aboba commented 2 years ago

Summary from October 11, 2022 meeting: wt.incomingUnidirectionalStreams (a "stream of streams") is not transferrable. Not surprising that transferring complex streams provides poor performance, since there is a lot of overhead involved with creating a stream for each outgoing frame, or receiving an incoming unidirectional stream for each incoming frame. However, transferring simple undirectional streams or the readable or writable portions of a bidirectional stream should be possible. See: https://github.com/w3c/webtransport/issues/424

w3c / webtransport

Question: Sharing a WebTransport connection in multiple workers? #420