Closed Torrion closed 2 years ago
Hey @Torrion 👋🏻. The max_jobs
option is used to have a PHP-FPM
-like behavior (restart a PHP process every n-requests). You can safely skip this option if you don't have memory leaks in your PHP code.
Could you please provide me with a sample to reproduce your issue and the .rr.yaml
configuration you used?
Or, if you do have memory leaks, you may use the supervisor
option with a memory limit.
Thanks @rustatian for your answer
Or, if you do have memory leaks, you may use the
supervisor
option with a memory limit.
There was problems with supervisor on roadrunner v2.4-2.6 on my symphony setup, but you're right - now it's quite useful.
However, more recent test with supervisor
and no max_jobs
option produces errors like:
PanicError: sync_worker_exec: SoftJobError:
sync_worker_exec_payload: LogicException: Got the response to undefined request 9030 in /app/vendor/temporal/sdk/src/Internal/Transport/Client.php:60
Stack trace:
#0 /app/vendor/temporal/sdk/src/WorkerFactory.php(389): Temporal\Internal\Transport\Client->dispatch(Object(Temporal\Worker\Transport\Command\SuccessResponse))
#1 /app/vendor/temporal/sdk/src/WorkerFactory.php(261): Temporal\WorkerFactory->dispatch('\n'\x08\xC6F*"\n \n\x16\n\x08en...', Array)
#2 /app/vendor/spiral/temporal-bridge/src/Dispatcher.php(61): Temporal\WorkerFactory->run()
#3 /app/vendor/spiral/framework/src/Core/src/ContainerScope.php(46): Spiral\TemporalBridge\Dispatcher->serve()
#4 /app/vendor/spiral/framework/src/Core/src/Container.php(282): Spiral\Core\ContainerScope::runScope(Object(Spiral\Core\Container), Array)
#5 /app/vendor/spiral/framework/src/Boot/src/AbstractKernel.php(212): Spiral\Core\Container->runScope(Array, Array)
#6 /app/app.php(39): Spiral\Boot\AbstractKernel->serve()
#7 {main}
process event for default [panic]:
github.com/temporalio/roadrunner-temporal/aggregatedpool.(*Workflow).OnWorkflowTaskStarted(0xc000081590, 0xc0009f97c0?)
github.com/temporalio/roadrunner-temporal@v1.4.4/aggregatedpool/workflow.go:153 +0x2e8
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc0006579b0, 0xc0009f9880, 0x60?, 0x1)
go.temporal.io/sdk@v1.14.0/internal/internal_event_handlers.go:815 +0x203
go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc0005f2bd0, 0xc000a8e8a0)
go.temporal.io/sdk@v1.14.0/internal/internal_task_handlers.go:878 +0xca8
go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc000806a50, 0xc000a8e8a0, 0xc000a8ec30)
go.temporal.io/sdk@v1.14.0/internal/internal_task_handlers.go:727 +0x485
go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc000772340, 0xc000a8e8a0)
go.temporal.io/sdk@v1.14.0/internal/internal_task_pollers.go:284 +0x2cd
go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc000772340, {0x15b7080?, 0xc000a8e8a0?})
go.temporal.io/sdk@v1.14.0/internal/internal_task_pollers.go:255 +0x6c
go.temporal.io/sdk/internal.(*baseWorker).processTask(0xc0006e0140, {0x15b6c40?, 0xc0009fb040})
go.temporal.io/sdk@v1.14.0/internal/internal_worker_base.go:398 +0x167
created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
go.temporal.io/sdk@v1.14.0/internal/internal_worker_base.go:302 +0xb5
or even simplier:
Error: activity_pool_execute_activity: Workers watcher stopped:
static_pool_exec_with_context:
static_pool_exec_with_context:
worker_watcher_get_free_worker
Could you please provide me with a sample to reproduce your issue and the .rr.yaml configuration you used?
Ok, thanks for the details 👍🏻. You don't have enough workers to serve all your activities, and you should limit the max number of concurrent activities or increase the number of workers.
You may check the php worker options (especially MaxConcurrentActivityExecutionSize
).
You should increase this option: https://github.com/Torrion/temporal-worker-pool-leak-test/blob/main/.rr.yaml#L13 If your tasks are CPU intensive, you should use at least 1.5*CPU_CORES workers. If your tasks are IO-bound - you may use 10-50-100 workers safely.
@Torrion Also, don't use a -d
option for the RR
. This is not debugging as it was in the RRv1. This is pprof server that will hugely slow down RR.
I've faced the same problem after upgrading RR to 2.10.3 and PHP-SDK to 1.3.2 Temporal can't find workers for activity reporting the same error:
Error: activity_pool_execute_activity: Workers watcher stopped:
static_pool_exec:
static_pool_exec:
worker_watcher_get_free_worker
There were no such problems before upgrade with the same load
Our rr activities config:
temporal:
activities:
num_workers: 4
max_jobs: 64
supervisor:
ttl: 30m
max_worker_memory: 256
Ok, guys, thanks for the reports. I'll investigate the regression.
You should increase this option: https://github.com/Torrion/temporal-worker-pool-leak-test/blob/main/.rr.yaml#L13 If your tasks are CPU intensive, you should use at least 1.5*CPU_CORES workers. If your tasks are IO-bound - you may use 10-50-100 workers safely.
Looks like increasing workers count and reducing activities count per worker doesn't helps. As I understand my code produce only 1 activity in time, and 2 workers must be enough since this is test stand and only 1 workflow runs.
Also, don't use a -d option for the RR. This is not debugging as it was in the RRv1. This is pprof server that will hugely slow down RR.
Oh. I see, thanks!
Ok, guys, thanks for the reports. I'll investigate the regression.
Thank you!
A couple of error messages from worker's log, hope it will help:
{"level":"debug","ts":1654603823.9559908,"logger":"temporal","msg":"outgoing message","id":178,"data":"","context":""}
{"level":"error","ts":1654603823.9560623,"logger":"temporal","msg":"Activity error.","Namespace":"schedule_manager_crud","TaskQueue":"default","WorkerID":"default:0503ba48-cc2a-4a44-84a7-9a968b92ad0b","WorkflowID":"29cb8ca5-a797-4d1e-b009-47b7079574ce","RunID":"07fce548-cae3-4de3-9d29-1aae6dcbcefe","ActivityType":"EducationServiceWasChanged.findClassTypeProperties","Attempt":5,"Error":"activity_pool_execute_activity: Workers watcher stopped:\n\tstatic_pool_exec:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker"}
{"level":"debug","ts":1654603823.9712448,"logger":"temporal","msg":"workflow task started","time":1}
{"level":"debug","ts":1654603823.9713047,"logger":"temporal","msg":"outgoing message","id":9001,"data":"","context":""}
{"level":"debug","ts":1654603823.9744446,"logger":"temporal","msg":"received message","command":{},"id":9002,"data":"\n\ufffd\u0005\u0008\ufffdF\u0012\u0010CompleteWorkflow\u001a\u0002{}\"\ufffd\u0005\n\u000eactivity error\u0012\u0005GoSDK\u001a\ufffd\u0003/opt/app/workflows_crud/src/ScheduleManagerWorkflowsCrudBundle/Feature/EducationServiceWasChanged/EducationServiceWasChangedWorkflow.php:39\n/opt/app/vendor/temporal/sdk/src/Worker/Worker.php:94\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:413\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:387\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:261\n/opt/app/bin/worker_crud:44\"\ufffd\u0001\n~activity_pool_execute_activity: Workers watcher stopped:\n\tstatic_pool_exec:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker\u0012\u0005GoSDK*\u0007\n\u0005ErrorZm\u0008\u0005\u0010\u0006\u001a,default:0503ba48-cc2a-4a44-84a7-9a968b92ad0b\"4\n2EducationServiceWasChanged.findClassTypeProperties*\u000150\u0004*\u0000"}
A couple of error messages from worker's log, hope it will help:
{"level":"debug","ts":1654603823.9559908,"logger":"temporal","msg":"outgoing message","id":178,"data":"","context":""} {"level":"error","ts":1654603823.9560623,"logger":"temporal","msg":"Activity error.","Namespace":"schedule_manager_crud","TaskQueue":"default","WorkerID":"default:0503ba48-cc2a-4a44-84a7-9a968b92ad0b","WorkflowID":"29cb8ca5-a797-4d1e-b009-47b7079574ce","RunID":"07fce548-cae3-4de3-9d29-1aae6dcbcefe","ActivityType":"EducationServiceWasChanged.findClassTypeProperties","Attempt":5,"Error":"activity_pool_execute_activity: Workers watcher stopped:\n\tstatic_pool_exec:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker"}
{"level":"debug","ts":1654603823.9712448,"logger":"temporal","msg":"workflow task started","time":1} {"level":"debug","ts":1654603823.9713047,"logger":"temporal","msg":"outgoing message","id":9001,"data":"","context":""} {"level":"debug","ts":1654603823.9744446,"logger":"temporal","msg":"received message","command":{},"id":9002,"data":"\n\ufffd\u0005\u0008\ufffdF\u0012\u0010CompleteWorkflow\u001a\u0002{}\"\ufffd\u0005\n\u000eactivity error\u0012\u0005GoSDK\u001a\ufffd\u0003/opt/app/workflows_crud/src/ScheduleManagerWorkflowsCrudBundle/Feature/EducationServiceWasChanged/EducationServiceWasChangedWorkflow.php:39\n/opt/app/vendor/temporal/sdk/src/Worker/Worker.php:94\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:413\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:387\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:261\n/opt/app/bin/worker_crud:44\"\ufffd\u0001\n~activity_pool_execute_activity: Workers watcher stopped:\n\tstatic_pool_exec:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker\u0012\u0005GoSDK*\u0007\n\u0005ErrorZm\u0008\u0005\u0010\u0006\u001a,default:0503ba48-cc2a-4a44-84a7-9a968b92ad0b\"4\n2EducationServiceWasChanged.findClassTypeProperties*\u000150\u0004*\u0000"}
Thanks, @mzavatsky. Nice, I guess I know where is the issue 👍🏻
@Torrion @mzavatsky Would you mind guys trying this version: https://github.com/roadrunner-server/roadrunner/releases/tag/v2.10.4-rc.1?
@rustatian I've tested on test stand from previous message and now it suddenly stucks after pool restart.
Here is example:
[36mapp_1 |[0m 2022-06-09T17:20:39.971Z DEBUG temporal worker stopped, restarting pool and temporal workers {"message": "process exited"}
[36mapp_1 |[0m 2022-06-09T17:20:39.971Z INFO temporal reset signal received, resetting activity and workflow worker pools
[36mapp_1 |[0m 2022-06-09T17:20:39.971Z WARN temporal Failed to poll for task. {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:a569809d-7d86-463a-9284-122a75ef8fba", "WorkerType": "WorkflowWorker", "Error": "worker stopping"}
[36mapp_1 |[0m 2022-06-09T17:20:39.971Z DEBUG server idle_ttl {"reason": "idle ttl is reached", "pid": 292, "internal_event_name": "EventTTL"}
[36mapp_1 |[0m 2022-06-09T17:20:39.972Z WARN temporal Failed to poll for task. {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:a569809d-7d86-463a-9284-122a75ef8fba", "WorkerType": "ActivityWorker", "Error": "worker stopping"}
[36mapp_1 |[0m 2022-06-09T17:20:39.972Z INFO temporal Stopped Worker {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:a569809d-7d86-463a-9284-122a75ef8fba"}
[36mapp_1 |[0m 2022-06-09T17:20:39.985Z DEBUG server worker destroyed {"pid": 282, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.113Z DEBUG server worker is allocated {"pid": 330, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.113Z INFO temporal workflow pool restarted
[36mapp_1 |[0m 2022-06-09T17:20:40.113Z DEBUG server worker is allocated {"pid": 329, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.132Z DEBUG server worker destroyed {"pid": 296, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.135Z DEBUG server worker destroyed {"pid": 289, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.135Z DEBUG server worker destroyed {"pid": 307, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.135Z DEBUG server worker destroyed {"pid": 294, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.135Z DEBUG server worker destroyed {"pid": 297, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.135Z DEBUG server worker destroyed {"pid": 290, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.136Z DEBUG server worker destroyed {"pid": 293, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.136Z DEBUG server worker destroyed {"pid": 295, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.136Z DEBUG server worker destroyed {"pid": 292, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.137Z DEBUG server worker destroyed {"pid": 329, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.226Z DEBUG server worker is allocated {"pid": 338, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.270Z DEBUG server worker is allocated {"pid": 337, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.274Z DEBUG server worker is allocated {"pid": 339, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.281Z DEBUG server worker is allocated {"pid": 341, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.285Z DEBUG server worker is allocated {"pid": 340, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.285Z DEBUG server worker is allocated {"pid": 342, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.305Z DEBUG server worker is allocated {"pid": 364, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.306Z DEBUG server worker is allocated {"pid": 356, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.337Z DEBUG server worker is allocated {"pid": 349, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.339Z DEBUG server worker is allocated {"pid": 360, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-09T17:20:40.339Z INFO temporal activity pool restarted
[36mapp_1 |[0m 2022-06-09T17:20:40.339Z DEBUG temporal outgoing message {"id": 0, "data": "", "context": ""}
[36mapp_1 |[0m 2022-06-09T17:20:40.344Z DEBUG temporal received message {"command": null, "id": 0, "data": "\n\ufffd\u0004*\ufffd\u0004\n\ufffd\u0004\n\u0016\n\u0008encoding\u0012\njson/plain\u0012\ufffd\u0004{\"TaskQueue\":\"default\",\"Options\":{\"MaxConcurrentActivityExecutionSize\":0,\"WorkerActivitiesPerSecond\":0.0,\"MaxConcurrentLocalActivityExecutionSize\":0,\"WorkerLocalActivitiesPerSecond\":0.0,\"TaskQueueActivitiesPerSecond\":0.0,\"MaxConcurrentActivityTaskPollers\":0,\"MaxConcurrentWorkflowTaskExecutionSize\":0,\"MaxConcurrentWorkflowTaskPollers\":0,\"StickyScheduleToStartTimeout\":null,\"WorkerStopTimeout\":null,\"EnableSessionWorker\":false,\"SessionResourceID\":null,\"MaxConcurrentSessionExecutionSize\":1000},\"Workflows\":[{\"Name\":\"many_action_test\",\"Queries\":[],\"Signals\":[]}],\"Activities\":[{\"Name\":\"exec\"}]}"}
[36mapp_1 |[0m 2022-06-09T17:20:40.344Z DEBUG temporal worker info {"taskqueue": "default", "optionsError": "json: unsupported type: func(error)"}
[36mapp_1 |[0m 2022-06-09T17:20:40.345Z DEBUG temporal workflow registered {"taskqueue": "default", "workflow name": "many_action_test"}
[36mapp_1 |[0m 2022-06-09T17:20:40.345Z DEBUG temporal activity registered {"taskqueue": "default", "workflow name": "exec"}
[36mapp_1 |[0m 2022-06-09T17:20:40.345Z DEBUG temporal workers initialized {"num_workers": 1}
[36mapp_1 |[0m 2022-06-09T17:20:40.346Z INFO temporal Started Worker {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:94d08fa9-7a6f-47d5-ae25-6f8d15dc31f7"}
[36mapp_1 |[0m 2022-06-09T17:25:45.003Z DEBUG temporal workflow execute {"runID": "7073a076-b253-46c7-b1f2-5950cafd8f07", "workflow info": {"WorkflowExecution":{"ID":"61ccf3a7-7834-4577-b877-3cb909728ea6","RunID":"7073a076-b253-46c7-b1f2-5950cafd8f07"},"WorkflowType":{"Name":"many_action_test"},"TaskQueueName":"default","WorkflowExecutionTimeout":0,"WorkflowRunTimeout":0,"WorkflowTaskTimeout":10000000000,"Namespace":"default","Attempt":1,"WorkflowStartTime":"2022-06-09T17:19:29.586527157Z","CronSchedule":"","ContinuedExecutionRunID":"","ParentWorkflowNamespace":"","ParentWorkflowExecution":null,"Memo":null,"SearchAttributes":null,"RetryPolicy":{"InitialInterval":1000000000,"BackoffCoefficient":2,"MaximumInterval":100000000000,"MaximumAttempts":2,"NonRetryableErrorTypes":null},"BinaryChecksum":"c4f512bed6295ec6c12bdbcf5957f8ce"}}
[36mapp_1 |[0m 2022-06-09T17:25:45.003Z DEBUG temporal sequenceID {"before": 6}
[36mapp_1 |[0m 2022-06-09T17:25:45.003Z DEBUG temporal sequenceID {"after": 7}
[36mapp_1 |[0m 2022-06-09T17:25:45.003Z DEBUG temporal workflow task started {"time": "1s"}
On 17:20:40 temporal scheduled activity task with expiration in 5 minutes. Activity workers just had waiting for tasks until 17:25:45, when expiration event have come to workflow worker.
@Torrion Is that from your sample repo?
Is that from your sample repo?
Yes
Is that from your sample repo?
Yes
Hm.. Could you please recreate your docker images, because I'm using your sample-repo, and everything finished ok.
@Torrion I slightly reduced a loop with the activities from the 1mi to 100 to have a chance to finish that workflow (random exception hardcoded in the activity).
69, 2022-06-10T09:47:59Z, ActivityTaskScheduled
70, 2022-06-10T09:48:01Z, ActivityTaskStarted
71, 2022-06-10T09:48:02Z, ActivityTaskCompleted
72, 2022-06-10T09:48:02Z, WorkflowTaskScheduled
73, 2022-06-10T09:48:07Z, WorkflowTaskTimedOut
74, 2022-06-10T09:48:07Z, WorkflowTaskScheduled
75, 2022-06-10T09:48:07Z, WorkflowTaskStarted
76, 2022-06-10T09:48:07Z, WorkflowTaskCompleted
77, 2022-06-10T09:48:07Z, ActivityTaskScheduled
78, 2022-06-10T09:48:07Z, ActivityTaskStarted
79, 2022-06-10T09:48:07Z, ActivityTaskCompleted
80, 2022-06-10T09:48:07Z, WorkflowTaskScheduled
81, 2022-06-10T09:48:12Z, WorkflowTaskTimedOut
82, 2022-06-10T09:48:12Z, WorkflowTaskScheduled
83, 2022-06-10T09:48:12Z, WorkflowTaskStarted
84, 2022-06-10T09:48:12Z, WorkflowTaskCompleted
85, 2022-06-10T09:48:12Z, WorkflowExecutionCompleted
Result:
Run Time: 70 seconds
Status: COMPLETED
Output: [nil]
Try to do the following:
tctl
tool to start a workflow: tctl workflow run --taskqueue=default --workflow_type=many_action_test
I've tested it again. Firstly I thought there was my mistake and I've forgot to recreate container yesterday as everything worked fine on the first look. Especially with supervisor
. With max_jobs
there were some workflow task start timeout errors within 5s after tasks had been scheduled(however timeout is 10s) but this is not a problem actually. But then I left my workflow for a long time and it became failed after 6 minutes after start.
[36mapp_1 |[0m 2022-06-10T10:03:20.255Z DEBUG temporal workflow task started {"time": "1s"}
[36mapp_1 |[0m 2022-06-10T10:03:20.255Z DEBUG temporal outgoing message {"id": 9024, "data": "", "context": ""}
[36mapp_1 |[0m 2022-06-10T10:03:20.258Z DEBUG temporal received message {"command": {"name":"exec","options":{"ActivityID":"","TaskQueueName":"default","ScheduleToCloseTimeout":300000000000,"ScheduleToStartTimeout":0,"StartToCloseTimeout":0,"HeartbeatTimeout":0,"WaitForCancellation":false,"OriginalTaskQueueName":"","RetryPolicy":{"backoff_coefficient":2,"maximum_attempts":5}}}, "id": 9025, "data": "\n\ufffd\u0002\u0008\ufffdF\u0012\u000fExecuteActivity\u001a\ufffd\u0002{\"name\":\"exec\",\"options\":{\"TaskQueueName\":\"default\",\"ScheduleToCloseTimeout\":300000000000,\"ScheduleToStartTimeout\":0,\"StartToCloseTimeout\":0,\"HeartbeatTimeout\":0,\"WaitForCancellation\":false,\"ActivityID\":\"\",\"RetryPolicy\":{\"initial_interval\":null,\"backoff_coefficient\":2,\"maximum_interval\":null,\"maximum_attempts\":5,\"non_retryable_error_types\":[]}}}*\u0000"}
[36mapp_1 |[0m 2022-06-10T10:03:20.258Z DEBUG temporal ExecuteActivity {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:ec69b6c3-a4c4-473c-b6e7-336cefbf6573", "WorkflowType": "many_action_test", "WorkflowID": "9139895b-c7ea-48fa-9cdb-c998ea2593a3", "RunID": "4e5ac690-246f-4ef1-ba14-a2a47094be84", "Attempt": 1, "ActivityID": "151", "ActivityType": "exec"}
[36mapp_1 |[0m 2022-06-10T10:03:20.534Z DEBUG temporal worker stopped, restarting pool and temporal workers {"message": "process exited"}
[36mapp_1 |[0m 2022-06-10T10:03:20.534Z INFO temporal reset signal received, resetting activity and workflow worker pools
[36mapp_1 |[0m 2022-06-10T10:03:20.535Z WARN temporal Failed to poll for task. {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:ec69b6c3-a4c4-473c-b6e7-336cefbf6573", "WorkerType": "WorkflowWorker", "Error": "worker stopping"}
[36mapp_1 |[0m 2022-06-10T10:03:20.535Z WARN temporal Failed to poll for task. {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:ec69b6c3-a4c4-473c-b6e7-336cefbf6573", "WorkerType": "ActivityWorker", "Error": "worker stopping"}
[36mapp_1 |[0m 2022-06-10T10:03:20.535Z INFO temporal Stopped Worker {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:ec69b6c3-a4c4-473c-b6e7-336cefbf6573"}
[36mapp_1 |[0m 2022-06-10T10:03:20.550Z DEBUG server worker destroyed {"pid": 182, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-10T10:03:20.617Z DEBUG server worker is allocated {"pid": 198, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-10T10:03:20.620Z DEBUG server worker is allocated {"pid": 199, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-10T10:03:20.620Z INFO temporal workflow pool restarted
[36mapp_1 |[0m 2022-06-10T10:03:20.632Z DEBUG server worker destroyed {"pid": 198, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-10T10:03:20.632Z DEBUG server worker destroyed {"pid": 188, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-10T10:03:20.632Z DEBUG server worker destroyed {"pid": 186, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1 |[0m 2022-06-10T10:03:20.685Z DEBUG server worker is allocated {"pid": 207, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-10T10:03:20.686Z DEBUG server worker is allocated {"pid": 206, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-10T10:03:20.686Z DEBUG server worker is allocated {"pid": 208, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1 |[0m 2022-06-10T10:03:20.686Z INFO temporal activity pool restarted
[36mapp_1 |[0m 2022-06-10T10:03:20.686Z DEBUG temporal outgoing message {"id": 0, "data": "", "context": ""}
[36mapp_1 |[0m 2022-06-10T10:03:20.691Z DEBUG temporal received message {"command": null, "id": 0, "data": "\n\ufffd\u0004*\ufffd\u0004\n\ufffd\u0004\n\u0016\n\u0008encoding\u0012\njson/plain\u0012\ufffd\u0004{\"TaskQueue\":\"default\",\"Options\":{\"MaxConcurrentActivityExecutionSize\":0,\"WorkerActivitiesPerSecond\":0.0,\"MaxConcurrentLocalActivityExecutionSize\":0,\"WorkerLocalActivitiesPerSecond\":0.0,\"TaskQueueActivitiesPerSecond\":0.0,\"MaxConcurrentActivityTaskPollers\":0,\"MaxConcurrentWorkflowTaskExecutionSize\":0,\"MaxConcurrentWorkflowTaskPollers\":0,\"StickyScheduleToStartTimeout\":null,\"WorkerStopTimeout\":null,\"EnableSessionWorker\":false,\"SessionResourceID\":null,\"MaxConcurrentSessionExecutionSize\":1000},\"Workflows\":[{\"Name\":\"many_action_test\",\"Queries\":[],\"Signals\":[]}],\"Activities\":[{\"Name\":\"exec\"}]}"}
[36mapp_1 |[0m 2022-06-10T10:03:20.691Z DEBUG temporal worker info {"taskqueue": "default", "optionsError": "json: unsupported type: func(error)"}
[36mapp_1 |[0m 2022-06-10T10:03:20.691Z DEBUG temporal workflow registered {"taskqueue": "default", "workflow name": "many_action_test"}
[36mapp_1 |[0m 2022-06-10T10:03:20.691Z DEBUG temporal activity registered {"taskqueue": "default", "workflow name": "exec"}
[36mapp_1 |[0m 2022-06-10T10:03:20.691Z DEBUG temporal workers initialized {"num_workers": 1}
[36mapp_1 |[0m 2022-06-10T10:03:20.694Z INFO temporal Started Worker {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:8c7f3f3b-9716-4e3a-8150-5c62884c7d85"}
[36mapp_1 |[0m 2022-06-10T10:08:25.730Z DEBUG temporal workflow execute {"runID": "4e5ac690-246f-4ef1-ba14-a2a47094be84", "workflow info": {"WorkflowExecution":{"ID":"9139895b-c7ea-48fa-9cdb-c998ea2593a3","RunID":"4e5ac690-246f-4ef1-ba14-a2a47094be84"},"WorkflowType":{"Name":"many_action_test"},"TaskQueueName":"default","WorkflowExecutionTimeout":0,"WorkflowRunTimeout":0,"WorkflowTaskTimeout":10000000000,"Namespace":"default","Attempt":1,"WorkflowStartTime":"2022-06-10T10:02:04.471351284Z","CronSchedule":"","ContinuedExecutionRunID":"","ParentWorkflowNamespace":"","ParentWorkflowExecution":null,"Memo":null,"SearchAttributes":null,"RetryPolicy":{"InitialInterval":1000000000,"BackoffCoefficient":2,"MaximumInterval":100000000000,"MaximumAttempts":2,"NonRetryableErrorTypes":null},"BinaryChecksum":"c4f512bed6295ec6c12bdbcf5957f8ce"}}
[36mapp_1 |[0m 2022-06-10T10:08:25.730Z DEBUG temporal sequenceID {"before": 4}
[36mapp_1 |[0m 2022-06-10T10:08:25.730Z DEBUG temporal sequenceID {"after": 5}
[36mapp_1 |[0m 2022-06-10T10:08:25.730Z DEBUG temporal workflow task started {"time": "1s"}
[36mapp_1 |[0m 2022-06-10T10:08:25.731Z DEBUG temporal outgoing message {"id": 5, "data": "", "context": ""}
[36mapp_1 |[0m 2022-06-10T10:08:25.763Z DEBUG temporal received message {"command": {"name":"exec","options":{"ActivityID":"","TaskQueueName":"default","ScheduleToCloseTimeout":300000000000,"ScheduleToStartTimeout":0,"StartToCloseTimeout":0,"HeartbeatTimeout":0,"WaitForCancellation":false,"OriginalTaskQueueName":"","RetryPolicy":{"backoff_coefficient":2,"maximum_attempts":5}}}, "id": 9001, "data": "\n\ufffd\u0002\u0008\ufffdF\u0012\u000fExecuteActivity\u001a\ufffd\u0002{\"name\":\"exec\",\"options\":{\"TaskQueueName\":\"default\",\"ScheduleToCloseTimeout\":300000000000,\"ScheduleToStartTimeout\":0,\"StartToCloseTimeout\":0,\"HeartbeatTimeout\":0,\"WaitForCancellation\":false,\"ActivityID\":\"\",\"RetryPolicy\":{\"initial_interval\":null,\"backoff_coefficient\":2,\"maximum_interval\":null,\"maximum_attempts\":5,\"non_retryable_error_types\":[]}}}*\u0000\n\u001f\u0008\u0005*\u001b\n\u0019\n\u0017\n\u0008encoding\u0012\u000bbinary/null"}
[36mapp_1 |[0m 2022-06-10T10:08:25.763Z DEBUG temporal received message {"command": null, "id": 5, "data": "\n\ufffd\u0002\u0008\ufffdF\u0012\u000fExecuteActivity\u001a\ufffd\u0002{\"name\":\"exec\",\"options\":{\"TaskQueueName\":\"default\",\"ScheduleToCloseTimeout\":300000000000,\"ScheduleToStartTimeout\":0,\"StartToCloseTimeout\":0,\"HeartbeatTimeout\":0,\"WaitForCancellation\":false,\"ActivityID\":\"\",\"RetryPolicy\":{\"initial_interval\":null,\"backoff_coefficient\":2,\"maximum_interval\":null,\"maximum_attempts\":5,\"non_retryable_error_types\":[]}}}*\u0000\n\u001f\u0008\u0005*\u001b\n\u0019\n\u0017\n\u0008encoding\u0012\u000bbinary/null"}
[36mapp_1 |[0m 2022-06-10T10:08:25.763Z DEBUG temporal workflow task started {"time": "1s"}
On 10:08:25 temporal waked up workflow task worker with timeout error.
@Torrion Could you send me a wf history?
workflow task worker with timeout error.
You have a test exception
in your code and a limited number of retries. That is expected to get a timeout because temporal can't finish WF.
Here you are https://gist.github.com/Torrion/d1bab6a7ae41784550635551263154ed
You have a test exception in your code and a limited number of retries. That is expected to get a timeout because temporal can't finish WF.
Look at this wf log https://gist.github.com/Torrion/c6de20cd9d53ff93aeb725effa6844a0 There are exceptions occurred in events 42,66, 84 etc but no timeout. Event 116 timed out with no reason as for me.
@Torrion It's ok to get an exception from PHP. Then, temporal will retry this execution if the exception is retryable. But, during your execution, you got too many of these exceptions (4):
RetryOptions::new()->withMaximumAttempts(5)
And timeout:
->withScheduleToCloseTimeout('5 minute')
You didn't attach a timestamp, but I guess that as you wrote before it was 5 min from the event 1 to 116.
Also, the initial issue contains a report for the Unable to kill workflow because workflow process
. Do you see these errors in the logs?
And Got the response to undefined request 9030
as well.
You didn't attach a timestamp
Updated both logs with time
Also, the initial issue contains a report for the Unable to kill workflow because workflow process. Do you see these errors in the logs?
No
And Got the response to undefined request 9030 as well.
And also, no
I guess that as you wrote before it was 5 min from the event 1 to 116
I think there are different activity runs produced these errors, but it needs more investigation and you're right, looks like another problem. Also this behavior occurred only if I use max_jobs
instead of supervisor
, which is more applicable for usage with temporal. I'll open a new issue if I found something.
However, sleeping for 5 minutes had place in tests with supervisor
enabled. So original problem solved partially, imho.
I think there are different activity runs produced these errors
For sure, they are different. Because you have a random error in your code: https://github.com/Torrion/temporal-worker-pool-leak-test/blob/main/app/src/Temporal/TestActivity.php#L18. Thus, your workflow execution has a problem: Random errors. As I mentioned before, generally, this is ok. But is not a good practice to have such workflows with 100+ activities (and 1mi in your case). Would be better to split that into smaller groups to let Temporal cache your executions.
I can guess, that you have bigger limits for the supervisor, so, the full resets occurred not very often, which let the workflow to complete even with test exception
(using retries).
You may achieve the same by increasing max_jobs
by 2mi for example.
So, the timeout in your case is an entirely expected behavior. WF execution errors are expected as well. To finish the WF anyway, use a bigger RR pool limit (max_jobs: 200000000+) or use a somewhat bigger ScheduleToClose timeout (like 9999 hours).
Ok. I tried to launch a new test without exceptions, but got the same WorkflowTaskTimedOut and I don't understand why. If I just increase max_jobs/idle_ttl/exec_ttl/whatever it will mean that workers will be reloaded during a one of the next workflow executions, so this is not a solution.
I've updated code in
And here is logs when the error occurred on 20th second of workflow execution: https://gist.github.com/Torrion/61bc6e857dec1fb9160a2ff2a6dfdfaf
16:41:32 - workflow task is scheduled 16:41:37 - it timed out And no any activity between these points
Ok. I tried to launch a new test without exceptions, but got the same WorkflowTaskTimedOut and I don't understand why.
Why you specify to restart workers every 10s (idle)? -> https://github.com/Torrion/temporal-worker-pool-leak-test/blob/main/.rr.yaml#L16
Why you specify to restart workers every 10s (idle)?
For this test it doesn't metter, but in real app, activity may open some socket connections like DB etc, so them could become broken if idle for a long time. Ok, increased idle_ttl
to 20s and errors are gone. But still do not understand, how idle_ttl
setted up on activity worker affects workflow worker. Does workflow worker must continue to work, not depending on what happened in activity worker pool?
@Torrion I tried your setup and don't have any Timeouts
Ok, increased idle_ttl to 20s and errors are gone. Does workflow worker must continue to work, not depending on what happened in activity worker pool?
idle_ttl
resets the whole execution. Because it restarts the RR workers with the Temporal workers. Thus some activities might be timeouted during the reset.
idle_ttl resets the whole execution
But declaration points to activity workers, not workflow workers. If some activity workers are idle but workflow worker is not idle - why it must be rebooted? That's a bit confusing. It will be great if you somehow document this approach. One more thing. Correct me please if I'm wrong: WorkflowTaskTimedOut error in my logs means not timeout of execution itself, but that workflow worker stopped to respond (because it is rebooting) to temporal server and it decided to rerun workflow execution on another available worker.
@Torrion
But declaration points to activity workers, not workflow workers. If some activity workers are idle but workflow worker is not idle - why it must be rebooted?
You can manage only activity workers from the configuration. RR Workflow worker is created internally and not available for configuration.
WF worker is always on, there is no supervisor in it (it's a special type of worker). The only way to restart it - is to kill it directly or to reset it via the reset
command.
WorkflowTaskTimedOut error in my logs means not timeout of execution itself
There is a term called - Sticky execution
. When RR + Temporal executes your activities, temporal will put your execution in the sticky queue. But, when the idle_ttl
or max_jobs
limits are reached, RR purges the Sticky WF cache, stops all workers, and starts everything again. At the same time, Temporal sees that there are workers dedicated to the previous sticky queue, and you see these timeouts. That is not a Workflow timeout, but just a Workflow Task timeout (task will be rescheduled to the Normal queue and executed).
WF worker is always on, there is no supervisor in it (it's a special type of worker). The only way to restart it - is to kill it directly or to reset it via the reset command.
But, when the idle_ttl or max_jobs limits are reached, RR purges the Sticky WF cache, stops all workers, and starts everything again.
I see a contradiction in this 2 statements. The question is why not supervisored WF pool is reacting on events in supervisored Activity pool. In my vision they should be completely independent from each other. But discussion about this will be an off-topic.
In sum, original issue is resolved. Everything works well on test stand and my project staging setup. Production testing is scheduled for next week.
Thank you very much for quick answers and fixes!
@Torrion
I see a contradiction in this 2 statements.
Yeah, I'll clarify my statement. I wanted to describe a general approach w/o specifying many details. But, ok: Inside a roadrunner-temporal plugin, we have three types of workers: WF worker (only 1), Activity workers (configurable by user), and Temporal workers.
You can't restart these workers independently. It'll lead to errors in the bug description. RR process supervisor
will track the conditions specified in the configuration (supervisor
) and sends signals to the worker_watcher
(idle_ttl
, exec_ttl
, max_memory
, ttl
). Worker watcher will gracefully stop the process and send an event about that. Roadrunner-temporal plugin listens for the process stopped events
and calls a Reset
internally because we should keep the state synced. So, I guess you understand now that the phrases:
The only way to restart it - is to kill it directly or reset it via the reset command.
when the idle_ttl or max_jobs limits are reached, RR purges the Sticky WF cache, stops all workers
Describes the different parts of the system.We will cover this in the docs, thank you for your questions.
If everything works well, feel free to close the bug 👍🏻
No duplicates 🥲.
What happened?
Greetings. I was trying to investigate reasons for some errors in processing my workflows with a lot of activities(e.g. 100+) and have found something strange when worker pools were recreated during workflow execution. Especially when max_jobs option for activities in .rr.yaml file is used, because it causes pool to be recreated more often. Sometimes it simple causes activity execution timeout but sometimes it cause to workflow task execution error and as result - whole workflow fails. On different setup errors slightly vary but in both cases stack trace of golang panic points to aggregatedpool/workflow.go:153 . Also If at least workflow tasks worker pool has restarted successfully (e.g. "workflow pool restarted" appears in log) - everything works fine. So for now I suppress exceptions in my workflows and turn of max_jobs but this imho not good decision.
Version
PHP 8.1.3 Roadrunner 2.10.3 (temporal plugin version 1.4.4) Temporal PHP SDK 1.3.2
Relevant log output