Open mupperton opened 2 months ago
I'm not familiar with the cluster
model for node, i'll take a look!
Likely this is a side effect HTTP2 being used?
This could be the case, as a single TCP connection is established, and then invocations are multiplexed within in as h2 streams.
Meanwhile I'd like to propose some alternatives:
alternatively if you are running on a bare metal, consider deploying more NodeJS process with Ngnix/Caddy as a reverse proxy in front of them (all in the same box reverse proxying to local host)
One additional tought:
Worker
API (https://nodejs.org/api/worker_threads.html)
to isolate the CPU intensive parts from the business logic? Can confirm that this is indeed due to HTTP2, as the cluster
module load balances per physical TCP connection, while HTTP2 keeps a single TCP connection but multiplex the streams on a single connection.
I've tried to look at ways to deal with this, and it seems that they require application side (pretty complicated) load balancing. Let me know if the alternative approaches are enough.
Thanks @igalshilman - my use case is not really CPU bound, more just wanting the ability for a single Node service that has many handlers to have better parallelism for handling multiple requests concurrently as Node is single-threaded, so the Worker API would probably be worse performance
We can try multiple pods and verify our network load balancing is working
Our pod scaling and load balancing is working as expected
I'll leave it up to you if you think it's worth keeping this issue open if there is a chance you may consider supporting this in the future, otherwise you can close
I regularly make use of the NodeJS cluster module in "normal" API services, as JS is single threaded, and we want to make use of all available parallelism on servers that have multiple cores/threads
This however appears to not work as expected for a restate-registered service
My observations are that requests from the restate server to the NodeJS service appear to have a "sticky session" or using some kind of Keep-Alive, as they appear to always use the same worker process for requests in a short time span, and it requires no requests for approx ~90 seconds (from basic testing) before another worker process will be used instead, and of course then that becomes the sticky worker until another ~90 seconds have passed
However this ultimately defeats the point of the cluster module, as it's designed to improve concurrency, but all concurrent requests will be handled by the same worker currently
Likely this is a side effect HTTP2 being used?
I haven't tried this with another runtime like Bun