Feature request: task result streams or AsyncIterators

airhorns commented 3 years ago

Hi folks -- thanks for putting together an awesome piece of software! I'd like to bring up a feature request to get an assessment for how hard it'd be to build into piscina: ReadableStream responses moved from worker threads to the calling thread. Instead of producing one Transferrable result at the end of a task, it'd be great to produce a stream or an AsyncIterator of results that the calling thread could access before the whole task is completed.

The concrete use case is using piscina to implement a server side renderer for client side React applications. Piscina is perfect for this: React SSR is a synchronous, CPU intensive task that usually blocks the main node thread for surprisingly long, and it'd be great to offload to a worker thread to keep the other work on the main event loop unblocked.

For maximum performance and the ideal user experience, React has built a streaming server side renderer that renders out chunks of HTML that can be streamed to the browser as they are produced, instead of having to wait until the entire render is complete before sending the first byte. This means the browser can start rendering, start prefetching other resources, etc etc, and is no extra work for the people building the React apps to support. See https://reactjs.org/docs/react-dom-server.html#rendertonodestream for more information.

I think architecturally this'd be a real sweet spot for piscina. To me, this library isn't the same as something like bullmq or pg-boss where tasks are persisted in some other system and worked by some other process. Piscina seems best at ephemeral, short lived tasks where you need the result quickly and don't want to pay as little of a network or serialization cost as possible. It's closer to a function call than those other things, and it'd be awesome if it supported the same return values that other function calls do, like Streams or AsyncIterators.

mcollina commented 3 years ago

For maximum performance and the ideal user experience, React has built a streaming server side renderer that renders out chunks of HTML that can be streamed to the browser as they are produced, instead of having to wait until the entire render is complete before sending the first byte. This means the browser can start rendering, start prefetching other resources, etc etc, and is no extra work for the people building the React apps to support. See https://reactjs.org/docs/react-dom-server.html#rendertonodestream for more information.

As far as I know, React renderToNodeStream worsen the performance of React SSR at scale. It's an antipattern that's best not to be used. renderToNodeStream uses more CPU than renderToString - trading latency for scalability is never a great choice in Node.js environments.

The problem with React SSR is not initial latency, it is the fact that you can easily engulf your event loop with more task that it can process in the given time allocation. As for any CPU bound activity, you want to lower the amount of CPU being used, not increase it.

Supporting a stream or async iterator as a result would be great nevertheless.

airhorns commented 3 years ago

We can talk this out in a different channel if you like as it isn't super relevant to Piscina, but I do think renderToNodeStream justifies this feature request from Piscina none the less, here's why:

As far as I know, React renderToNodeStream worsen the performance of React SSR at scale. It's an antipattern that's best not to be used.

Can you find any documentation that supports this? I am most interested in the best experience for the user, and the best experience for the user would allow streaming a result out to them before it has completed rendering. I prioritize user experience over CPU efficiency -- that's what React SSR is explicitly about. Also -- why would the React team add it to their API and support it if it was such a bad idea?

Especially for big renders, if the render takes 100ms, you've already blown your human-perception-of-instant time budget before dealing with network latency. When sending JS bundles built with a bundler that might have link preload=s or other split chunks that need to be hinted to the browser to get a good TTI, you need to get the browser downloading the assets as soon as possible. You also don't always know exactly what assets to send ahead of time because of React.lazy.

renderToNodeStream uses more CPU than renderToString - trading latency for scalability is never a great choice in Node.js environments. The problem with React SSR is not initial latency, it is the fact that you can easily engulf your event loop with more task that it can process in the given time allocation. As for any CPU bound activity, you want to lower the amount of CPU being used, not increase it.

Is the extra CPU just because there are a few more promises and stack pops? That seems like pretty small overhead. For big pages, minimizing TTFB makes a lot of sense, and for small pages would it really dominate? It would also let more than one render run at once on a worker thread, which I think wouldn't increase throughput but would increase resource utilization as you wouldn't be forced to have a gazillion threads and kernel level context switching.

Also, as React adds SSR support for Suspense and Concurrent mode, there are going to be stack pops and promise awaits anyways, and the implementation of renderToNodeStream may improve. React doesn't yet have suspense support during SSR, but all the existing solutions like react-ssr-prepass are async at least, and I think could be streaming if they wanted to be. Libraries like styled-components also include server side streaming renderers themselves so they can stream out CSS blocks as they are rendered and style the halfway sent documents too! Also, I know node's high watermark for stream chunk sending is 15kb which many pages will be under the size of, but for the pages that are over that size, which I am building many of, I would like my users to be happy!

jasnell commented 3 years ago

So this is something that I've had on my list for a bit and have played around with some. There is some complexity in that the only way to achieve the highest performance it would need a native addon and a custom transferable object implementation (currently only possible if done within node.js core). It is possible to implement a streaming model on top of MessagePort and transferable typedarrays tho. See the example here: https://github.com/piscinajs/piscina/blob/current/examples/stream/stream.mjs

airhorns commented 3 years ago

Do you think it'd be worth building support for streams / async iterators over a message port to the library itself? It might be nice to have a battle tested, shared implementation of that that bakes in the smart folks' performance knowledge for doing it the not-super-fast way while we figure out if node might add the bits needed to make it really fast?

mcollina commented 3 years ago

We can talk this out in a different channel if you like as it isn't super relevant to Piscina, but I do think renderToNodeStream justifies this feature request from Piscina none the less, here's why:

As far as I know, React renderToNodeStream worsen the performance of React SSR at scale. It's an antipattern that's best not to be used.

Can you find any documentation that supports this? I am most interested in the best experience for the user, and the best experience for the user would allow streaming a result out to them before it has completed rendering. I prioritize user experience over CPU efficiency -- that's what React SSR is explicitly about. Also -- why would the React team add it to their API and support it if it was such a bad idea?

I cannot speak for the React team. As from somebody working on and developing Node.js core, it does not make any sense to do it. The benefit of React SSR come from caching: I can render some HTML and place in a CDN forever. That's the sweet spot for React SSR.

Especially for big renders, if the render takes 100ms, you've already blown your human-perception-of-instant time budget before dealing with network latency. When sending JS bundles built with a bundler that might have link preload=s or other split chunks that need to be hinted to the browser to get a good TTI, you need to get the browser downloading the assets as soon as possible. You also don't always know exactly what assets to send ahead of time because of React.lazy.

renderToNodeStream uses more CPU than renderToString - trading latency for scalability is never a great choice in Node.js environments.

The problem with React SSR is not initial latency, it is the fact that you can easily engulf your event loop with more task that it can process in the given time allocation. As for any CPU bound activity, you want to lower the amount of CPU being used, not increase it.

Is the extra CPU just because there are a few more promises and stack pops? That seems like pretty small overhead. For big pages, minimizing TTFB makes a lot of sense, and for small pages would it really dominate? It would also let more than one render run at once on a worker thread, which I think wouldn't increase throughput but would increase resource utilization as you wouldn't be forced to have a gazillion threads and kernel level context switching.

Unfortunately the biggest hit comes from the garbage collector. A large React page allocates a lot of data because there are a lot of DOM nodes. Because the rendering is spread over time, most of that data is retained until the page has finished rendering: it will move to "old space", quickly balooning memory usage. Data in old space is expensive to collect from a CPU point of view.

As most Node.js applications are deployed with 1 core available, the GC will "steal" CPU cycles for rendering other pages. The result is that very often it will take more server time to send the full page via stream rather than a string. You will need 1.5x-2x the number of servers to use streaming.

The best approach for getting low-latency to the users is to have the pages cached in a CDN.

airhorns commented 3 years ago

Because the rendering is spread over time, most of that data is retained until the page has finished rendering: it will move to "old space", quickly balooning memory usage. Data in old space is expensive to collect from a CPU point of view.

The GC can still move things to the old generation during synchronous calls, can't it? From what I know, minor collection cycles are triggered by growth in new space allocations, not by time. If streaming rendering vs synchronous rendering generate roughly the same amount of garbage plus or minus some promises and stream chunks, then they should promote roughly the same amount to the old space and cause roughly the same amount of slow major collections. What am I missing? Streaming rendering should also let the heap bloat less because chunks can be flushed and GC'd before the whole render is complete?

The best approach for getting low-latency to the users is to have the pages cached in a CDN.

Sure -- if you can do that though, why not pre-render using next.js or something like that? Even if you can't pre-render and are just using read through caching by being the origin for a CDN, the extra cost of server side rendering is amortized across all the other cache hit requests to the CDN. I am most worried about the situations where you can't pre-render and where the cache is ineffective because that generates the highest volume of server side rendering, like applications behind auth walls that include some user details in the response, or beta flag state, or any of the bits that make caching web applications hard. I feel like those situations would apply the most pressure to a server side renderer and should govern the design, and in those situations the user experience matters a lot to me.

I also think you are still prioritizing lower server costs over TTFB which compromises the experience and I think if that's the case someone who cares should benchmark and prove that it really does lower server costs and we should trade away experience for it.

airhorns commented 3 years ago

Opened https://github.com/airhorns/fastify-renderer/issues/13 for us to continue this discussion and not bother the piscina folks!

jasnell commented 2 years ago

https://github.com/piscinajs/piscina/pull/170 adds an example that uses the new whatwg streams support in node.js to stream data into and out of a running worker. I believe that addresses the need here! :-)

JoshuaWise commented 1 year ago

I had this problem as well, so I created a new thread pool library specifically for this purpose. Its API is heavily inspired by Piscina. In my tests, I found it to perform an order of magnitude faster than using WHATWG ReadableStream. Here it is: https://github.com/WiseLibs/wise-workers

On caveat though: I didn't implement atomics for thread communication, because I didn't see any performance improvement in my use-cases.

Maybe my library can serve an inspiration for Piscina. It wasn't really that complicated to implement support for generators. It might seem complicated at first, because you'd think you'd need to have a lot of coordination for pausing and resuming the generator across threads, but actually that behavior isn't really desirable because the point of using worker threads is to have both threads working at the same time (i.e., one thread producing data, the other one consuming data, simultaneously). So in a multi-threaded situation, you never actually want to pause the generator. In my library, that's exactly how it works; it just runs the generator eagerly to maximize parallelism.

It's worth noting that ReadableStream might be more desirable for some use-cases due to its support for backpressure, but in my case it wasn't relevant.

piscinajs / piscina

Feature request: task result streams or AsyncIterators #108