[🐛 BUG]: Cannot spawn worker while running in Google Cloud Run

ivoabx commented 3 months ago

No duplicates 🥲.

[X] I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

We introduced RoadRunner in our PHP project, which is currently based on Laravel 11. On local machines everything works as expected - much faster than fpm+nginx!

However, when running the workload in Google's Cloud Run for some reason it stalls from time to time. Once deployed it starts working fine and after some time(irregular intervals) it stops processing requests. All of them seem to stall. This state is accompanied by the following error messages:

failed to allocate the worker {"internal_event_name": "EventWorkerError", "error": "worker_watcher_allocate_new: WorkerAllocate: failed to spawn a worker, possible reasons: https://docs.roadrunner.dev/error-codes/allocate-timeout"}

allocate retry attempt failed {"internal_event_name": "EventWorkerError", "error": "failed to spawn a worker, possible reasons: https://docs.roadrunner.dev/error-codes/allocate-timeout"}

Locally, I'm able to reproduce it by setting the pool.allocate_timeout to less than 3s. This is not our configuration in CloudRun, though. We are running the default of 60s. We are also running it with ttl: 0 and idle_ttl: 10 so that it does proper memory cleanup. When trying idle_ttl: 0, it obviously does not happen, but memory keeps creeping up.

Do you have any thoughts/guidelines on why this might happen and how to fix it?

Thank you!

Regards, Ivo

Version (rr --version)

2024.1.2

How to reproduce the issue?

Deploy to Cloud Run. It will happen at some point when serving requests.

rr.txt

Relevant log output

allocate retry attempt failed   {"internal_event_name": "EventWorkerError", "error": "failed to spawn a worker, possible reasons: https://docs.roadrunner.dev/error-codes/allocate-timeout"}

allocate    {"error": "failed to spawn a worker, possible reasons: https://docs.roadrunner.dev/error-codes/allocate-timeout"}

rustatian commented 3 months ago

Hey @ivoabx 👋 Could you please share your configuration as well? As far as I understand, you're using Laravel Octane, am I right?

rustatian commented 3 months ago

Oh, sorry, I see, you attached it.

rustatian commented 3 months ago

Could you please double-check, in the env, by executing ./rr workers that RR has workers in it?

ivoabx commented 3 months ago

@rustatian I don't think I can exec into a Cloud Run container as it is a fully managed environment, based on kNative k8s. I think that we have workers running ok as prior to the stall, everything working fine. In addition to that, we have logs of successful workers being spawned: worker is allocated {"pid": 1674, "max_execs": 0, "internal_event_name": "EventWorkerConstruct"}.

The backend is serving traffic until it stops.

rustatian commented 3 months ago

If I understand correctly, you hit a super rare condition, when the following happened on your env:

RR started the worker.
Your workers restart every 10 seconds of inactivity. This is not needed option, you just spammed your env by allocating a lot of the processes.
RR tried to reallocate workers but failed to do that due to the limitation on how much child processes can process allocate and destroy. Here you may read more: link1 RLIMIT_NPROC, RLIMIT_NOFILE or just PID limits. This is likely to happen in a cloud env where the limits are precisely set.

How to resolve that?

Do not use ttl, idle_ttl. If you're using Octane, use max_worker_memory to let RR determine, when to stop and restart the process.
Do not set the max_worker_memory to a very low value. It should be the amount of memory your app generally consumes under some projected load + 10% as a buffer.

ivoabx commented 3 months ago

Thank you very much for the detailed information, hadn't thought of the PID limits. We are currently running with ttl:0 and idle_ttl:0. However, the memory keeps creeping up. Is there a way to deal with that? Is it caused by the application itself or the runtime? PHP runtime is new to me, I'm used to go ;)

rustatian commented 3 months ago

You may safely use max_worker_memory and RR will gracefully stop such a memory heavy worker, so, you won't consume a lot of memory.

However, the memory keeps creeping up. Is there a way to deal with that? Is it caused by the application itself or the runtime? PHP runtime is new to me, I'm used to go ;)

RR allocates child processes, which are called workers here. Your code is in that process. RR itself consumes minimal memory. But since you're using Laravel, it is ok to see a lot of the consumed memory. So, to control that, the parameter max_worker_memory exists which controls memory consumption per process.

ivoabx commented 3 months ago

Sounds great! Thanks again! Closing it as we most probably found the solution.

rustatian commented 3 months ago

My pleasure 👍 Please don't hesitate to reopen or just comment here on how you resolved this case. I guess it would be helpful for others who are searching for the similar solution 😃

roadrunner-server / roadrunner