[🐛 BUG]: Incorrect worker pool recreating

Torrion commented 2 years ago

No duplicates 🥲.

[X] I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

Greetings. I was trying to investigate reasons for some errors in processing my workflows with a lot of activities(e.g. 100+) and have found something strange when worker pools were recreated during workflow execution. Especially when max_jobs option for activities in .rr.yaml file is used, because it causes pool to be recreated more often. Sometimes it simple causes activity execution timeout but sometimes it cause to workflow task execution error and as result - whole workflow fails. On different setup errors slightly vary but in both cases stack trace of golang panic points to aggregatedpool/workflow.go:153 . Also If at least workflow tasks worker pool has restarted successfully (e.g. "workflow pool restarted" appears in log) - everything works fine. So for now I suppress exceptions in my workflows and turn of max_jobs but this imho not good decision.

Version

PHP 8.1.3 Roadrunner 2.10.3 (temporal plugin version 1.4.4) Temporal PHP SDK 1.3.2

Relevant log output

Log 1:

DEBUG   server          worker stopped  {"internal_event_name": "EventWorkerWaitExit", "error": "signal: killed; process_wait: signal: killed", "errorCauses": [{"error": "signal: killed"}, {"error": "process_wait: signal: killed"}]}
DEBUG   temporal        worker stopped, restarting pool and temporal workers    {"message": "process exited"}
INFO    temporal        reset signal received, resetting activity and workflow worker pools
DEBUG   server          worker is allocated {"pid": 473, "internal_event_name": "EventWorkerConstruct"}
DEBUG   server          worker destroyed    {"pid": 461, "internal_event_name": "EventWorkerDestruct"}
DEBUG   server          worker is allocated {"pid": 477, "internal_event_name": "EventWorkerConstruct"}
INFO    temporal        workflow pool restarted
DEBUG   server          worker destroyed    {"pid": 473, "internal_event_name": "EventWorkerDestruct"}
DEBUG   server          worker destroyed    {"pid": 466, "internal_event_name": "EventWorkerDestruct"}
DEBUG   temporal        workflow task started   {"time": "1s"}
DEBUG   temporal        outgoing message    {"id": 9266, "data": "", "context": ""}
DEBUG   server          worker is allocated {"pid": 482, "internal_event_name": "EventWorkerConstruct"}
DEBUG   server          worker is allocated {"pid": 481, "internal_event_name": "EventWorkerConstruct"}
INFO    temporal        activity pool restarted
WARN    temporal        Failed to poll for task.    {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:071a4091-8f8e-4dd9-bf4b-17275481ef76", "WorkerType": "WorkflowWorker", "Error": "worker stopping"}
WARN    server          worker stopped, and will be restarted   {"reason": "worker error", "pid": 477, "internal_event_name": "EventWorkerError", "error": "sync_worker_exec: SoftJobError:\n\tsync_worker_exec_payload: LogicException: Got the response to undefined request 9266 in /app/vendor/temporal/sdk/src/Internal/Transport/Client.php:60\nStack trace:\n#0 /app/vendor/temporal/sdk/src/WorkerFactory.php(389): Temporal\\Internal\\Transport\\Client->dispatch(Object(Temporal\\Worker\\Transport\\Command\\SuccessResponse))\n#1 /app/vendor/temporal/sdk/src/WorkerFactory.php(261): Temporal\\WorkerFactory->dispatch('\\n'\\x08\\xB2H*\"\\n \\n\\x16\\n\\x08en...', Array)\n#2 /app/vendor/spiral/temporal-bridge/src/Dispatcher.php(61): Temporal\\WorkerFactory->run()\n#3 /app/vendor/spiral/framework/src/Core/src/ContainerScope.php(46): Spiral\\TemporalBridge\\Dispatcher->serve()\n#4 /app/vendor/spiral/framework/src/Core/src/Container.php(282): Spiral\\Core\\ContainerScope::runScope(Object(Spiral\\Core\\Container), Array)\n#5 /app/vendor/spiral/framework/src/Boot/src/AbstractKernel.php(212): Spiral\\Core\\Container->runScope(Array, Array)\n#6 /app/app.php(39): Spiral\\Boot\\AbstractKernel->serve()\n#7 {main}"}
ERROR   temporal        Workflow panic  {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:071a4091-8f8e-4dd9-bf4b-17275481ef76", "WorkflowType": "many_action_test", "WorkflowID": "cfbc01e1-6880-48da-95df-b876bcde4165", "RunID": "f280e334-79c3-4c9b-a750-4c7c0064d38a", "Attempt": 1, "Error": "sync_worker_exec: SoftJobError:\n\tsync_worker_exec_payload: LogicException: Got the response to undefined request 9266 in /app/vendor/temporal/sdk/src/Internal/Transport/Client.php:60\nStack trace:\n#0 /app/vendor/temporal/sdk/src/WorkerFactory.php(389): Temporal\\Internal\\Transport\\Client->dispatch(Object(Temporal\\Worker\\Transport\\Command\\SuccessResponse))\n#1 /app/vendor/temporal/sdk/src/WorkerFactory.php(261): Temporal\\WorkerFactory->dispatch('\\n'\\x08\\xB2H*\"\\n \\n\\x16\\n\\x08en...', Array)\n#2 /app/vendor/spiral/temporal-bridge/src/Dispatcher.php(61): Temporal\\WorkerFactory->run()\n#3 /app/vendor/spiral/framework/src/Core/src/ContainerScope.php(46): Spiral\\TemporalBridge\\Dispatcher->serve()\n#4 /app/vendor/spiral/framework/src/Core/src/Container.php(282): Spiral\\Core\\ContainerScope::runScope(Object(Spiral\\Core\\Container), Array)\n#5 /app/vendor/spiral/framework/src/Boot/src/AbstractKernel.php(212): Spiral\\Core\\Container->runScope(Array, Array)\n#6 /app/app.php(39): Spiral\\Boot\\AbstractKernel->serve()\n#7 {main}", "StackTrace": "process event for default [panic]:\ngithub.com/temporalio/roadrunner-temporal/aggregatedpool.(*Workflow).OnWorkflowTaskStarted(0xc00088ad20, 0xc001b5e340?)\n\tgithub.com/temporalio/roadrunner-temporal@v1.4.4/aggregatedpool/workflow.go:153 +0x2e8\ngo.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc002059860, 0xc001b5e400, 0x60?, 0x1)\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_event_handlers.go:815 +0x203\ngo.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc002041570, 0xc00078b9e0)\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_task_handlers.go:878 +0xca8\ngo.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc0007d9130, 0xc00078b9e0, 0xc00078bd70)\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_task_handlers.go:727 +0x485\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc001ff7ad0, 0xc00078b9e0)\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_task_pollers.go:284 +0x2cd\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc001ff7ad0, {0x15b7080?, 0xc00078b9e0?})\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_task_pollers.go:255 +0x6c\ngo.temporal.io/sdk/internal.(*baseWorker).processTask(0xc0006e0f00, {0x15b6c40?, 0xc001aa56c0})\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_worker_base.go:398 +0x167\ncreated by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_worker_base.go:302 +0xb5"}
WARN    temporal        Failed to process workflow task.    {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:071a4091-8f8e-4dd9-bf4b-17275481ef76", "WorkflowType": "many_action_test", "WorkflowID": "cfbc01e1-6880-48da-95df-b876bcde4165", "RunID": "f280e334-79c3-4c9b-a750-4c7c0064d38a", "Attempt": 1, "Error": "sync_worker_exec: SoftJobError:\n\tsync_worker_exec_payload: LogicException: Got the response to undefined request 9266 in /app/vendor/temporal/sdk/src/Internal/Transport/Client.php:60\nStack trace:\n#0 /app/vendor/temporal/sdk/src/WorkerFactory.php(389): Temporal\\Internal\\Transport\\Client->dispatch(Object(Temporal\\Worker\\Transport\\Command\\SuccessResponse))\n#1 /app/vendor/temporal/sdk/src/WorkerFactory.php(261): Temporal\\WorkerFactory->dispatch('\\n'\\x08\\xB2H*\"\\n \\n\\x16\\n\\x08en...', Array)\n#2 /app/vendor/spiral/temporal-bridge/src/Dispatcher.php(61): Temporal\\WorkerFactory->run()\n#3 /app/vendor/spiral/framework/src/Core/src/ContainerScope.php(46): Spiral\\TemporalBridge\\Dispatcher->serve()\n#4 /app/vendor/spiral/framework/src/Core/src/Container.php(282): Spiral\\Core\\ContainerScope::runScope(Object(Spiral\\Core\\Container), Array)\n#5 /app/vendor/spiral/framework/src/Boot/src/AbstractKernel.php(212): Spiral\\Core\\Container->runScope(Array, Array)\n#6 /app/app.php(39): Spiral\\Boot\\AbstractKernel->serve()\n#7 {main}"}
DEBUG   temporal        sequenceID  {"before": 38}
DEBUG   temporal        sequenceID  {"after": 39}
DEBUG   server          worker stopped  {"internal_event_name": "EventWorkerWaitExit", "error": "signal: killed; process_wait: signal: killed", "errorCauses": [{"error": "signal: killed"}, {"error": "process_wait: signal: killed"}]}
DEBUG   temporal        outgoing message    {"id": 39, "data": "", "context": ""}
DEBUG   server          worker is allocated {"pid": 489, "internal_event_name": "EventWorkerConstruct"}
DEBUG   temporal        received message    {"command": null, "id": 39, "data": "\n\ufffd\u000f\u0008'\"\ufffd\u000f\ndUnable to kill workflow because workflow process #f280e334-79c3-4c9b-a750-4c7c0064d38a was not found\u0012\u0007PHP_SDK\u001a\ufffd\u000eInvalidArgumentException: Unable to kill workflow because workflow process #f280e334-79c3-4c9b-a750-4c7c0064d38a was not found in /app/vendor/temporal/sdk/src/Internal/Transport/Router/DestroyWorkflow.php:48\nStack trace:\n#0 /app/vendor/temporal/sdk/src/Internal/Transport/Router/DestroyWorkflow.php(34): Temporal\\Internal\\Transport\\Router\\DestroyWorkflow->kill('f280e334-79c3-4...')\n#1 /app/vendor/temporal/sdk/src/Internal/Transport/Router.php(81): Temporal\\Internal\\Transport\\Router\\DestroyWorkflow->handle(Object(Temporal\\Worker\\Transport\\Command\\Request), Array, Object(React\\Promise\\Deferred))\n#2 /app/vendor/temporal/sdk/src/Worker/Worker.php(94): Temporal\\Internal\\Transport\\Router->dispatch(Object(Temporal\\Worker\\Transport\\Command\\Request), Array)\n#3 /app/vendor/temporal/sdk/src/WorkerFactory.php(413): Temporal\\Worker\\Worker->dispatch(Object(Temporal\\Worker\\Transport\\Command\\Request), Array)\n#4 /app/vendor/temporal/sdk/src/Internal/Transport/Server.php(60): Temporal\\WorkerFactory->onRequest(Object(Temporal\\Worker\\Transport\\Command\\Request), Array)\n#5 /app/vendor/temporal/sdk/src/WorkerFactory.php(387): Temporal\\Internal\\Transport\\Server->dispatch(Object(Temporal\\Worker\\Transport\\Command\\Request), Array)\n#6 /app/vendor/temporal/sdk/src/WorkerFactory.php(261): Temporal\\WorkerFactory->dispatch('\\nE\\x08'\\x12\\x0FDestroyWo...', Array)\n#7 /app/vendor/spiral/temporal-bridge/src/Dispatcher.php(61): Temporal\\WorkerFactory->run()\n#8 /app/vendor/spiral/framework/src/Core/src/ContainerScope.php(46): Spiral\\TemporalBridge\\Dispatcher->serve()\n#9 /app/vendor/spiral/framework/src/Core/src/Container.php(282): Spiral\\Core\\ContainerScope::runScope(Object(Spiral\\Core\\Container), Array)\n#10 /app/vendor/spiral/framework/src/Boot/src/AbstractKernel.php(212): Spiral\\Core\\Container->runScope(Array, Array)\n#11 /app/app.php(39): Spiral\\Boot\\AbstractKernel->serve()\n#12 {main}*\u001a\n\u0018InvalidArgumentException"}
WARN    temporal        Failed to poll for task.    {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:071a4091-8f8e-4dd9-bf4b-17275481ef76", "WorkerType": "ActivityWorker", "Error": "worker stopping"}
INFO    temporal        Stopped Worker  {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:071a4091-8f8e-4dd9-bf4b-17275481ef76"}
DEBUG   temporal        outgoing message    {"id": 0, "data": "", "context": ""}
DEBUG   temporal        received message    {"command": null, "id": 0, "data": "\n\ufffd\u0004*\ufffd\u0004\n\ufffd\u0004\n\u0016\n\u0008encoding\u0012\njson/plain\u0012\ufffd\u0004{\"TaskQueue\":\"default\",\"Options\":{\"MaxConcurrentActivityExecutionSize\":0,\"WorkerActivitiesPerSecond\":0.0,\"MaxConcurrentLocalActivityExecutionSize\":0,\"WorkerLocalActivitiesPerSecond\":0.0,\"TaskQueueActivitiesPerSecond\":0.0,\"MaxConcurrentActivityTaskPollers\":0,\"MaxConcurrentWorkflowTaskExecutionSize\":0,\"MaxConcurrentWorkflowTaskPollers\":0,\"StickyScheduleToStartTimeout\":null,\"WorkerStopTimeout\":null,\"EnableSessionWorker\":false,\"SessionResourceID\":null,\"MaxConcurrentSessionExecutionSize\":1000},\"Workflows\":[{\"Name\":\"many_action_test\",\"Queries\":[],\"Signals\":[]}],\"Activities\":[{\"Name\":\"exec\"}]}"}
DEBUG   temporal        worker info {"taskqueue": "default", "options": {"MaxConcurrentActivityExecutionSize":0,"WorkerActivitiesPerSecond":0,"MaxConcurrentLocalActivityExecutionSize":0,"WorkerLocalActivitiesPerSecond":0,"TaskQueueActivitiesPerSecond":0,"MaxConcurrentActivityTaskPollers":0,"MaxConcurrentWorkflowTaskExecutionSize":0,"MaxConcurrentWorkflowTaskPollers":0,"EnableLoggingInReplay":false,"DisableStickyExecution":false,"StickyScheduleToStartTimeout":0,"BackgroundActivityContext":null,"WorkflowPanicPolicy":0,"WorkerStopTimeout":0,"EnableSessionWorker":false,"MaxConcurrentSessionExecutionSize":1000,"DisableWorkflowWorker":false,"LocalActivityWorkerOnly":false,"Identity":"","DeadlockDetectionTimeout":0,"MaxHeartbeatThrottleInterval":0,"DefaultHeartbeatThrottleInterval":0,"Interceptors":null}}
DEBUG   temporal        workflow registered {"taskqueue": "default", "workflow name": "many_action_test"}
DEBUG   temporal        activity registered {"taskqueue": "default", "workflow name": "exec"}
DEBUG   temporal        workers initialized {"num_workers": 1}
INFO    temporal        Started Worker  {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:ef206c5e-a1d0-4c3b-8404-fe99e29fa9fc"}
DEBUG   temporal        worker stopped, restarting pool and temporal workers    {"message": "process exited"}
INFO    temporal        reset signal received, resetting activity and workflow worker pools
DEBUG   server          worker destroyed    {"pid": 489, "internal_event_name": "EventWorkerDestruct"}
DEBUG   server          worker is allocated {"pid": 493, "internal_event_name": "EventWorkerConstruct"}
INFO    temporal        workflow pool restarted

Log 2:

static_pool_exec: Workers watcher stopped:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker", "StackTrace": "process event for default [panic]:\ngithub.com/temporalio/roadrunner-temporal/aggregatedpool.(*Workflow).OnWorkflowTaskStarted(0xc0008914a0, 0xc0011ac5c0?)\n\tgithub.com/temporalio/roadrunner-temporal@v1.4.4/aggregatedpool/workflow.go:153 +0x2e8\ngo.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc000f12600, 0xc0011ac680, 0x60?, 0x1)\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_event_handlers.go:815 +0x203\ngo.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc00019a5b0, 0xc0011b4d80)\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_task_handlers.go:878 +0xca8\ngo.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc000fd49a0, 0xc0011b4d80, 0xc0011b5110)\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_task_handlers.go:727 +0x485\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc00071a5b0, 0xc0011b4d80)\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_task_pollers.go:284 +0x2cd\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc00071a5b0, {0x15b7080?, 0xc0011b4d80?})\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_task_pollers.go:255 +0x6c\ngo.temporal.io/sdk/internal.(*baseWorker).processTask(0xc000544280, {0x15b6c40?, 0xc000efa260})\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_worker_base.go:398 +0x167\ncreated by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher\n\tgo.temporal.io/sdk@v1.14.0/internal/internal_worker_base.go:302 +0xb5

rustatian commented 2 years ago

Hey @Torrion 👋🏻. The max_jobs option is used to have a PHP-FPM-like behavior (restart a PHP process every n-requests). You can safely skip this option if you don't have memory leaks in your PHP code. Could you please provide me with a sample to reproduce your issue and the .rr.yaml configuration you used?

rustatian commented 2 years ago

Or, if you do have memory leaks, you may use the supervisor option with a memory limit.

Torrion commented 2 years ago

Thanks @rustatian for your answer

Or, if you do have memory leaks, you may use the supervisor option with a memory limit.

There was problems with supervisor on roadrunner v2.4-2.6 on my symphony setup, but you're right - now it's quite useful. However, more recent test with supervisor and no max_jobs option produces errors like:

PanicError: sync_worker_exec: SoftJobError:
    sync_worker_exec_payload: LogicException: Got the response to undefined request 9030 in /app/vendor/temporal/sdk/src/Internal/Transport/Client.php:60
Stack trace:
#0 /app/vendor/temporal/sdk/src/WorkerFactory.php(389): Temporal\Internal\Transport\Client->dispatch(Object(Temporal\Worker\Transport\Command\SuccessResponse))
#1 /app/vendor/temporal/sdk/src/WorkerFactory.php(261): Temporal\WorkerFactory->dispatch('\n'\x08\xC6F*"\n \n\x16\n\x08en...', Array)
#2 /app/vendor/spiral/temporal-bridge/src/Dispatcher.php(61): Temporal\WorkerFactory->run()
#3 /app/vendor/spiral/framework/src/Core/src/ContainerScope.php(46): Spiral\TemporalBridge\Dispatcher->serve()
#4 /app/vendor/spiral/framework/src/Core/src/Container.php(282): Spiral\Core\ContainerScope::runScope(Object(Spiral\Core\Container), Array)
#5 /app/vendor/spiral/framework/src/Boot/src/AbstractKernel.php(212): Spiral\Core\Container->runScope(Array, Array)
#6 /app/app.php(39): Spiral\Boot\AbstractKernel->serve()
#7 {main} 
process event for default [panic]:
github.com/temporalio/roadrunner-temporal/aggregatedpool.(*Workflow).OnWorkflowTaskStarted(0xc000081590, 0xc0009f97c0?)
    github.com/temporalio/roadrunner-temporal@v1.4.4/aggregatedpool/workflow.go:153 +0x2e8
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc0006579b0, 0xc0009f9880, 0x60?, 0x1)
    go.temporal.io/sdk@v1.14.0/internal/internal_event_handlers.go:815 +0x203
go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc0005f2bd0, 0xc000a8e8a0)
    go.temporal.io/sdk@v1.14.0/internal/internal_task_handlers.go:878 +0xca8
go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc000806a50, 0xc000a8e8a0, 0xc000a8ec30)
    go.temporal.io/sdk@v1.14.0/internal/internal_task_handlers.go:727 +0x485
go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc000772340, 0xc000a8e8a0)
    go.temporal.io/sdk@v1.14.0/internal/internal_task_pollers.go:284 +0x2cd
go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc000772340, {0x15b7080?, 0xc000a8e8a0?})
    go.temporal.io/sdk@v1.14.0/internal/internal_task_pollers.go:255 +0x6c
go.temporal.io/sdk/internal.(*baseWorker).processTask(0xc0006e0140, {0x15b6c40?, 0xc0009fb040})
    go.temporal.io/sdk@v1.14.0/internal/internal_worker_base.go:398 +0x167
created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
    go.temporal.io/sdk@v1.14.0/internal/internal_worker_base.go:302 +0xb5

or even simplier:

Error: activity_pool_execute_activity: Workers watcher stopped:
    static_pool_exec_with_context:
    static_pool_exec_with_context:
    worker_watcher_get_free_worker

Could you please provide me with a sample to reproduce your issue and the .rr.yaml configuration you used?

https://github.com/Torrion/temporal-worker-pool-leak-test

rustatian commented 2 years ago

Ok, thanks for the details 👍🏻. You don't have enough workers to serve all your activities, and you should limit the max number of concurrent activities or increase the number of workers. You may check the php worker options (especially MaxConcurrentActivityExecutionSize).

rustatian commented 2 years ago

You should increase this option: https://github.com/Torrion/temporal-worker-pool-leak-test/blob/main/.rr.yaml#L13 If your tasks are CPU intensive, you should use at least 1.5*CPU_CORES workers. If your tasks are IO-bound - you may use 10-50-100 workers safely.

rustatian commented 2 years ago

@Torrion Also, don't use a -d option for the RR. This is not debugging as it was in the RRv1. This is pprof server that will hugely slow down RR.

mzavatsky commented 2 years ago

I've faced the same problem after upgrading RR to 2.10.3 and PHP-SDK to 1.3.2 Temporal can't find workers for activity reporting the same error:

Error: activity_pool_execute_activity: Workers watcher stopped:
    static_pool_exec:
    static_pool_exec:
    worker_watcher_get_free_worker

There were no such problems before upgrade with the same load

mzavatsky commented 2 years ago

Our rr activities config:

temporal:
  activities:
    num_workers: 4
    max_jobs: 64
    supervisor:
      ttl: 30m
      max_worker_memory: 256

rustatian commented 2 years ago

Ok, guys, thanks for the reports. I'll investigate the regression.

Torrion commented 2 years ago

You should increase this option: https://github.com/Torrion/temporal-worker-pool-leak-test/blob/main/.rr.yaml#L13 If your tasks are CPU intensive, you should use at least 1.5*CPU_CORES workers. If your tasks are IO-bound - you may use 10-50-100 workers safely.

Looks like increasing workers count and reducing activities count per worker doesn't helps. As I understand my code produce only 1 activity in time, and 2 workers must be enough since this is test stand and only 1 workflow runs.

Also, don't use a -d option for the RR. This is not debugging as it was in the RRv1. This is pprof server that will hugely slow down RR.

Oh. I see, thanks!

Ok, guys, thanks for the reports. I'll investigate the regression.

Thank you!

mzavatsky commented 2 years ago

A couple of error messages from worker's log, hope it will help:

{"level":"debug","ts":1654603823.9559908,"logger":"temporal","msg":"outgoing message","id":178,"data":"","context":""}
{"level":"error","ts":1654603823.9560623,"logger":"temporal","msg":"Activity error.","Namespace":"schedule_manager_crud","TaskQueue":"default","WorkerID":"default:0503ba48-cc2a-4a44-84a7-9a968b92ad0b","WorkflowID":"29cb8ca5-a797-4d1e-b009-47b7079574ce","RunID":"07fce548-cae3-4de3-9d29-1aae6dcbcefe","ActivityType":"EducationServiceWasChanged.findClassTypeProperties","Attempt":5,"Error":"activity_pool_execute_activity: Workers watcher stopped:\n\tstatic_pool_exec:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker"}

{"level":"debug","ts":1654603823.9712448,"logger":"temporal","msg":"workflow task started","time":1}
{"level":"debug","ts":1654603823.9713047,"logger":"temporal","msg":"outgoing message","id":9001,"data":"","context":""}
{"level":"debug","ts":1654603823.9744446,"logger":"temporal","msg":"received message","command":{},"id":9002,"data":"\n\ufffd\u0005\u0008\ufffdF\u0012\u0010CompleteWorkflow\u001a\u0002{}\"\ufffd\u0005\n\u000eactivity error\u0012\u0005GoSDK\u001a\ufffd\u0003/opt/app/workflows_crud/src/ScheduleManagerWorkflowsCrudBundle/Feature/EducationServiceWasChanged/EducationServiceWasChangedWorkflow.php:39\n/opt/app/vendor/temporal/sdk/src/Worker/Worker.php:94\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:413\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:387\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:261\n/opt/app/bin/worker_crud:44\"\ufffd\u0001\n~activity_pool_execute_activity: Workers watcher stopped:\n\tstatic_pool_exec:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker\u0012\u0005GoSDK*\u0007\n\u0005ErrorZm\u0008\u0005\u0010\u0006\u001a,default:0503ba48-cc2a-4a44-84a7-9a968b92ad0b\"4\n2EducationServiceWasChanged.findClassTypeProperties*\u000150\u0004*\u0000"}

rustatian commented 2 years ago

A couple of error messages from worker's log, hope it will help:

{"level":"debug","ts":1654603823.9559908,"logger":"temporal","msg":"outgoing message","id":178,"data":"","context":""}
{"level":"error","ts":1654603823.9560623,"logger":"temporal","msg":"Activity error.","Namespace":"schedule_manager_crud","TaskQueue":"default","WorkerID":"default:0503ba48-cc2a-4a44-84a7-9a968b92ad0b","WorkflowID":"29cb8ca5-a797-4d1e-b009-47b7079574ce","RunID":"07fce548-cae3-4de3-9d29-1aae6dcbcefe","ActivityType":"EducationServiceWasChanged.findClassTypeProperties","Attempt":5,"Error":"activity_pool_execute_activity: Workers watcher stopped:\n\tstatic_pool_exec:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker"}

{"level":"debug","ts":1654603823.9712448,"logger":"temporal","msg":"workflow task started","time":1}
{"level":"debug","ts":1654603823.9713047,"logger":"temporal","msg":"outgoing message","id":9001,"data":"","context":""}
{"level":"debug","ts":1654603823.9744446,"logger":"temporal","msg":"received message","command":{},"id":9002,"data":"\n\ufffd\u0005\u0008\ufffdF\u0012\u0010CompleteWorkflow\u001a\u0002{}\"\ufffd\u0005\n\u000eactivity error\u0012\u0005GoSDK\u001a\ufffd\u0003/opt/app/workflows_crud/src/ScheduleManagerWorkflowsCrudBundle/Feature/EducationServiceWasChanged/EducationServiceWasChangedWorkflow.php:39\n/opt/app/vendor/temporal/sdk/src/Worker/Worker.php:94\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:413\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:387\n/opt/app/vendor/temporal/sdk/src/WorkerFactory.php:261\n/opt/app/bin/worker_crud:44\"\ufffd\u0001\n~activity_pool_execute_activity: Workers watcher stopped:\n\tstatic_pool_exec:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker\u0012\u0005GoSDK*\u0007\n\u0005ErrorZm\u0008\u0005\u0010\u0006\u001a,default:0503ba48-cc2a-4a44-84a7-9a968b92ad0b\"4\n2EducationServiceWasChanged.findClassTypeProperties*\u000150\u0004*\u0000"}

Thanks, @mzavatsky. Nice, I guess I know where is the issue 👍🏻

rustatian commented 2 years ago

@Torrion @mzavatsky Would you mind guys trying this version: https://github.com/roadrunner-server/roadrunner/releases/tag/v2.10.4-rc.1?

Torrion commented 2 years ago

@rustatian I've tested on test stand from previous message and now it suddenly stucks after pool restart.

Here is example:

[36mapp_1                     |[0m 2022-06-09T17:20:39.971Z   DEBUG   temporal        worker stopped, restarting pool and temporal workers    {"message": "process exited"}
[36mapp_1                     |[0m 2022-06-09T17:20:39.971Z   INFO    temporal        reset signal received, resetting activity and workflow worker pools
[36mapp_1                     |[0m 2022-06-09T17:20:39.971Z   WARN    temporal        Failed to poll for task.    {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:a569809d-7d86-463a-9284-122a75ef8fba", "WorkerType": "WorkflowWorker", "Error": "worker stopping"}
[36mapp_1                     |[0m 2022-06-09T17:20:39.971Z   DEBUG   server          idle_ttl    {"reason": "idle ttl is reached", "pid": 292, "internal_event_name": "EventTTL"}
[36mapp_1                     |[0m 2022-06-09T17:20:39.972Z   WARN    temporal        Failed to poll for task.    {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:a569809d-7d86-463a-9284-122a75ef8fba", "WorkerType": "ActivityWorker", "Error": "worker stopping"}
[36mapp_1                     |[0m 2022-06-09T17:20:39.972Z   INFO    temporal        Stopped Worker  {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:a569809d-7d86-463a-9284-122a75ef8fba"}
[36mapp_1                     |[0m 2022-06-09T17:20:39.985Z   DEBUG   server          worker destroyed    {"pid": 282, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.113Z   DEBUG   server          worker is allocated {"pid": 330, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.113Z   INFO    temporal        workflow pool restarted
[36mapp_1                     |[0m 2022-06-09T17:20:40.113Z   DEBUG   server          worker is allocated {"pid": 329, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.132Z   DEBUG   server          worker destroyed    {"pid": 296, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.135Z   DEBUG   server          worker destroyed    {"pid": 289, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.135Z   DEBUG   server          worker destroyed    {"pid": 307, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.135Z   DEBUG   server          worker destroyed    {"pid": 294, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.135Z   DEBUG   server          worker destroyed    {"pid": 297, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.135Z   DEBUG   server          worker destroyed    {"pid": 290, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.136Z   DEBUG   server          worker destroyed    {"pid": 293, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.136Z   DEBUG   server          worker destroyed    {"pid": 295, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.136Z   DEBUG   server          worker destroyed    {"pid": 292, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.137Z   DEBUG   server          worker destroyed    {"pid": 329, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.226Z   DEBUG   server          worker is allocated {"pid": 338, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.270Z   DEBUG   server          worker is allocated {"pid": 337, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.274Z   DEBUG   server          worker is allocated {"pid": 339, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.281Z   DEBUG   server          worker is allocated {"pid": 341, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.285Z   DEBUG   server          worker is allocated {"pid": 340, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.285Z   DEBUG   server          worker is allocated {"pid": 342, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.305Z   DEBUG   server          worker is allocated {"pid": 364, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.306Z   DEBUG   server          worker is allocated {"pid": 356, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.337Z   DEBUG   server          worker is allocated {"pid": 349, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.339Z   DEBUG   server          worker is allocated {"pid": 360, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.339Z   INFO    temporal        activity pool restarted
[36mapp_1                     |[0m 2022-06-09T17:20:40.339Z   DEBUG   temporal        outgoing message    {"id": 0, "data": "", "context": ""}
[36mapp_1                     |[0m 2022-06-09T17:20:40.344Z   DEBUG   temporal        received message    {"command": null, "id": 0, "data": "\n\ufffd\u0004*\ufffd\u0004\n\ufffd\u0004\n\u0016\n\u0008encoding\u0012\njson/plain\u0012\ufffd\u0004{\"TaskQueue\":\"default\",\"Options\":{\"MaxConcurrentActivityExecutionSize\":0,\"WorkerActivitiesPerSecond\":0.0,\"MaxConcurrentLocalActivityExecutionSize\":0,\"WorkerLocalActivitiesPerSecond\":0.0,\"TaskQueueActivitiesPerSecond\":0.0,\"MaxConcurrentActivityTaskPollers\":0,\"MaxConcurrentWorkflowTaskExecutionSize\":0,\"MaxConcurrentWorkflowTaskPollers\":0,\"StickyScheduleToStartTimeout\":null,\"WorkerStopTimeout\":null,\"EnableSessionWorker\":false,\"SessionResourceID\":null,\"MaxConcurrentSessionExecutionSize\":1000},\"Workflows\":[{\"Name\":\"many_action_test\",\"Queries\":[],\"Signals\":[]}],\"Activities\":[{\"Name\":\"exec\"}]}"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.344Z   DEBUG   temporal        worker info {"taskqueue": "default", "optionsError": "json: unsupported type: func(error)"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.345Z   DEBUG   temporal        workflow registered {"taskqueue": "default", "workflow name": "many_action_test"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.345Z   DEBUG   temporal        activity registered {"taskqueue": "default", "workflow name": "exec"}
[36mapp_1                     |[0m 2022-06-09T17:20:40.345Z   DEBUG   temporal        workers initialized {"num_workers": 1}
[36mapp_1                     |[0m 2022-06-09T17:20:40.346Z   INFO    temporal        Started Worker  {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:94d08fa9-7a6f-47d5-ae25-6f8d15dc31f7"}
[36mapp_1                     |[0m 2022-06-09T17:25:45.003Z   DEBUG   temporal        workflow execute    {"runID": "7073a076-b253-46c7-b1f2-5950cafd8f07", "workflow info": {"WorkflowExecution":{"ID":"61ccf3a7-7834-4577-b877-3cb909728ea6","RunID":"7073a076-b253-46c7-b1f2-5950cafd8f07"},"WorkflowType":{"Name":"many_action_test"},"TaskQueueName":"default","WorkflowExecutionTimeout":0,"WorkflowRunTimeout":0,"WorkflowTaskTimeout":10000000000,"Namespace":"default","Attempt":1,"WorkflowStartTime":"2022-06-09T17:19:29.586527157Z","CronSchedule":"","ContinuedExecutionRunID":"","ParentWorkflowNamespace":"","ParentWorkflowExecution":null,"Memo":null,"SearchAttributes":null,"RetryPolicy":{"InitialInterval":1000000000,"BackoffCoefficient":2,"MaximumInterval":100000000000,"MaximumAttempts":2,"NonRetryableErrorTypes":null},"BinaryChecksum":"c4f512bed6295ec6c12bdbcf5957f8ce"}}
[36mapp_1                     |[0m 2022-06-09T17:25:45.003Z   DEBUG   temporal        sequenceID  {"before": 6}
[36mapp_1                     |[0m 2022-06-09T17:25:45.003Z   DEBUG   temporal        sequenceID  {"after": 7}
[36mapp_1                     |[0m 2022-06-09T17:25:45.003Z   DEBUG   temporal        workflow task started   {"time": "1s"}

On 17:20:40 temporal scheduled activity task with expiration in 5 minutes. Activity workers just had waiting for tasks until 17:25:45, when expiration event have come to workflow worker.

rustatian commented 2 years ago

@Torrion Is that from your sample repo?

Torrion commented 2 years ago

Is that from your sample repo?

Yes

rustatian commented 2 years ago

Is that from your sample repo?

Yes

Hm.. Could you please recreate your docker images, because I'm using your sample-repo, and everything finished ok.

rustatian commented 2 years ago

@Torrion I slightly reduced a loop with the activities from the 1mi to 100 to have a chance to finish that workflow (random exception hardcoded in the activity).

  69, 2022-06-10T09:47:59Z, ActivityTaskScheduled
  70, 2022-06-10T09:48:01Z, ActivityTaskStarted
  71, 2022-06-10T09:48:02Z, ActivityTaskCompleted
  72, 2022-06-10T09:48:02Z, WorkflowTaskScheduled
  73, 2022-06-10T09:48:07Z, WorkflowTaskTimedOut
  74, 2022-06-10T09:48:07Z, WorkflowTaskScheduled
  75, 2022-06-10T09:48:07Z, WorkflowTaskStarted
  76, 2022-06-10T09:48:07Z, WorkflowTaskCompleted
  77, 2022-06-10T09:48:07Z, ActivityTaskScheduled
  78, 2022-06-10T09:48:07Z, ActivityTaskStarted
  79, 2022-06-10T09:48:07Z, ActivityTaskCompleted
  80, 2022-06-10T09:48:07Z, WorkflowTaskScheduled
  81, 2022-06-10T09:48:12Z, WorkflowTaskTimedOut
  82, 2022-06-10T09:48:12Z, WorkflowTaskScheduled
  83, 2022-06-10T09:48:12Z, WorkflowTaskStarted
  84, 2022-06-10T09:48:12Z, WorkflowTaskCompleted
  85, 2022-06-10T09:48:12Z, WorkflowExecutionCompleted

Result:
  Run Time: 70 seconds
  Status: COMPLETED
  Output: [nil]

rustatian commented 2 years ago

Try to do the following:

Remove (docker-compose down) the previous image with the temporal bundle.
Recreate the RR image.
Re-run your test. I used a tctl tool to start a workflow: tctl workflow run --taskqueue=default --workflow_type=many_action_test

Torrion commented 2 years ago

I've tested it again. Firstly I thought there was my mistake and I've forgot to recreate container yesterday as everything worked fine on the first look. Especially with supervisor. With max_jobs there were some workflow task start timeout errors within 5s after tasks had been scheduled(however timeout is 10s) but this is not a problem actually. But then I left my workflow for a long time and it became failed after 6 minutes after start.

[36mapp_1                     |[0m 2022-06-10T10:03:20.255Z   DEBUG   temporal        workflow task started   {"time": "1s"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.255Z   DEBUG   temporal        outgoing message    {"id": 9024, "data": "", "context": ""}
[36mapp_1                     |[0m 2022-06-10T10:03:20.258Z   DEBUG   temporal        received message    {"command": {"name":"exec","options":{"ActivityID":"","TaskQueueName":"default","ScheduleToCloseTimeout":300000000000,"ScheduleToStartTimeout":0,"StartToCloseTimeout":0,"HeartbeatTimeout":0,"WaitForCancellation":false,"OriginalTaskQueueName":"","RetryPolicy":{"backoff_coefficient":2,"maximum_attempts":5}}}, "id": 9025, "data": "\n\ufffd\u0002\u0008\ufffdF\u0012\u000fExecuteActivity\u001a\ufffd\u0002{\"name\":\"exec\",\"options\":{\"TaskQueueName\":\"default\",\"ScheduleToCloseTimeout\":300000000000,\"ScheduleToStartTimeout\":0,\"StartToCloseTimeout\":0,\"HeartbeatTimeout\":0,\"WaitForCancellation\":false,\"ActivityID\":\"\",\"RetryPolicy\":{\"initial_interval\":null,\"backoff_coefficient\":2,\"maximum_interval\":null,\"maximum_attempts\":5,\"non_retryable_error_types\":[]}}}*\u0000"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.258Z   DEBUG   temporal        ExecuteActivity {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:ec69b6c3-a4c4-473c-b6e7-336cefbf6573", "WorkflowType": "many_action_test", "WorkflowID": "9139895b-c7ea-48fa-9cdb-c998ea2593a3", "RunID": "4e5ac690-246f-4ef1-ba14-a2a47094be84", "Attempt": 1, "ActivityID": "151", "ActivityType": "exec"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.534Z   DEBUG   temporal        worker stopped, restarting pool and temporal workers    {"message": "process exited"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.534Z   INFO    temporal        reset signal received, resetting activity and workflow worker pools
[36mapp_1                     |[0m 2022-06-10T10:03:20.535Z   WARN    temporal        Failed to poll for task.    {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:ec69b6c3-a4c4-473c-b6e7-336cefbf6573", "WorkerType": "WorkflowWorker", "Error": "worker stopping"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.535Z   WARN    temporal        Failed to poll for task.    {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:ec69b6c3-a4c4-473c-b6e7-336cefbf6573", "WorkerType": "ActivityWorker", "Error": "worker stopping"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.535Z   INFO    temporal        Stopped Worker  {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:ec69b6c3-a4c4-473c-b6e7-336cefbf6573"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.550Z   DEBUG   server          worker destroyed    {"pid": 182, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.617Z   DEBUG   server          worker is allocated {"pid": 198, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.620Z   DEBUG   server          worker is allocated {"pid": 199, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.620Z   INFO    temporal        workflow pool restarted
[36mapp_1                     |[0m 2022-06-10T10:03:20.632Z   DEBUG   server          worker destroyed    {"pid": 198, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.632Z   DEBUG   server          worker destroyed    {"pid": 188, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.632Z   DEBUG   server          worker destroyed    {"pid": 186, "internal_event_name": "EventWorkerDestruct"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.685Z   DEBUG   server          worker is allocated {"pid": 207, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.686Z   DEBUG   server          worker is allocated {"pid": 206, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.686Z   DEBUG   server          worker is allocated {"pid": 208, "internal_event_name": "EventWorkerConstruct"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.686Z   INFO    temporal        activity pool restarted
[36mapp_1                     |[0m 2022-06-10T10:03:20.686Z   DEBUG   temporal        outgoing message    {"id": 0, "data": "", "context": ""}
[36mapp_1                     |[0m 2022-06-10T10:03:20.691Z   DEBUG   temporal        received message    {"command": null, "id": 0, "data": "\n\ufffd\u0004*\ufffd\u0004\n\ufffd\u0004\n\u0016\n\u0008encoding\u0012\njson/plain\u0012\ufffd\u0004{\"TaskQueue\":\"default\",\"Options\":{\"MaxConcurrentActivityExecutionSize\":0,\"WorkerActivitiesPerSecond\":0.0,\"MaxConcurrentLocalActivityExecutionSize\":0,\"WorkerLocalActivitiesPerSecond\":0.0,\"TaskQueueActivitiesPerSecond\":0.0,\"MaxConcurrentActivityTaskPollers\":0,\"MaxConcurrentWorkflowTaskExecutionSize\":0,\"MaxConcurrentWorkflowTaskPollers\":0,\"StickyScheduleToStartTimeout\":null,\"WorkerStopTimeout\":null,\"EnableSessionWorker\":false,\"SessionResourceID\":null,\"MaxConcurrentSessionExecutionSize\":1000},\"Workflows\":[{\"Name\":\"many_action_test\",\"Queries\":[],\"Signals\":[]}],\"Activities\":[{\"Name\":\"exec\"}]}"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.691Z   DEBUG   temporal        worker info {"taskqueue": "default", "optionsError": "json: unsupported type: func(error)"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.691Z   DEBUG   temporal        workflow registered {"taskqueue": "default", "workflow name": "many_action_test"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.691Z   DEBUG   temporal        activity registered {"taskqueue": "default", "workflow name": "exec"}
[36mapp_1                     |[0m 2022-06-10T10:03:20.691Z   DEBUG   temporal        workers initialized {"num_workers": 1}
[36mapp_1                     |[0m 2022-06-10T10:03:20.694Z   INFO    temporal        Started Worker  {"Namespace": "default", "TaskQueue": "default", "WorkerID": "default:8c7f3f3b-9716-4e3a-8150-5c62884c7d85"}
[36mapp_1                     |[0m 2022-06-10T10:08:25.730Z   DEBUG   temporal        workflow execute    {"runID": "4e5ac690-246f-4ef1-ba14-a2a47094be84", "workflow info": {"WorkflowExecution":{"ID":"9139895b-c7ea-48fa-9cdb-c998ea2593a3","RunID":"4e5ac690-246f-4ef1-ba14-a2a47094be84"},"WorkflowType":{"Name":"many_action_test"},"TaskQueueName":"default","WorkflowExecutionTimeout":0,"WorkflowRunTimeout":0,"WorkflowTaskTimeout":10000000000,"Namespace":"default","Attempt":1,"WorkflowStartTime":"2022-06-10T10:02:04.471351284Z","CronSchedule":"","ContinuedExecutionRunID":"","ParentWorkflowNamespace":"","ParentWorkflowExecution":null,"Memo":null,"SearchAttributes":null,"RetryPolicy":{"InitialInterval":1000000000,"BackoffCoefficient":2,"MaximumInterval":100000000000,"MaximumAttempts":2,"NonRetryableErrorTypes":null},"BinaryChecksum":"c4f512bed6295ec6c12bdbcf5957f8ce"}}
[36mapp_1                     |[0m 2022-06-10T10:08:25.730Z   DEBUG   temporal        sequenceID  {"before": 4}
[36mapp_1                     |[0m 2022-06-10T10:08:25.730Z   DEBUG   temporal        sequenceID  {"after": 5}
[36mapp_1                     |[0m 2022-06-10T10:08:25.730Z   DEBUG   temporal        workflow task started   {"time": "1s"}
[36mapp_1                     |[0m 2022-06-10T10:08:25.731Z   DEBUG   temporal        outgoing message    {"id": 5, "data": "", "context": ""}
[36mapp_1                     |[0m 2022-06-10T10:08:25.763Z   DEBUG   temporal        received message    {"command": {"name":"exec","options":{"ActivityID":"","TaskQueueName":"default","ScheduleToCloseTimeout":300000000000,"ScheduleToStartTimeout":0,"StartToCloseTimeout":0,"HeartbeatTimeout":0,"WaitForCancellation":false,"OriginalTaskQueueName":"","RetryPolicy":{"backoff_coefficient":2,"maximum_attempts":5}}}, "id": 9001, "data": "\n\ufffd\u0002\u0008\ufffdF\u0012\u000fExecuteActivity\u001a\ufffd\u0002{\"name\":\"exec\",\"options\":{\"TaskQueueName\":\"default\",\"ScheduleToCloseTimeout\":300000000000,\"ScheduleToStartTimeout\":0,\"StartToCloseTimeout\":0,\"HeartbeatTimeout\":0,\"WaitForCancellation\":false,\"ActivityID\":\"\",\"RetryPolicy\":{\"initial_interval\":null,\"backoff_coefficient\":2,\"maximum_interval\":null,\"maximum_attempts\":5,\"non_retryable_error_types\":[]}}}*\u0000\n\u001f\u0008\u0005*\u001b\n\u0019\n\u0017\n\u0008encoding\u0012\u000bbinary/null"}
[36mapp_1                     |[0m 2022-06-10T10:08:25.763Z   DEBUG   temporal        received message    {"command": null, "id": 5, "data": "\n\ufffd\u0002\u0008\ufffdF\u0012\u000fExecuteActivity\u001a\ufffd\u0002{\"name\":\"exec\",\"options\":{\"TaskQueueName\":\"default\",\"ScheduleToCloseTimeout\":300000000000,\"ScheduleToStartTimeout\":0,\"StartToCloseTimeout\":0,\"HeartbeatTimeout\":0,\"WaitForCancellation\":false,\"ActivityID\":\"\",\"RetryPolicy\":{\"initial_interval\":null,\"backoff_coefficient\":2,\"maximum_interval\":null,\"maximum_attempts\":5,\"non_retryable_error_types\":[]}}}*\u0000\n\u001f\u0008\u0005*\u001b\n\u0019\n\u0017\n\u0008encoding\u0012\u000bbinary/null"}
[36mapp_1                     |[0m 2022-06-10T10:08:25.763Z   DEBUG   temporal        workflow task started   {"time": "1s"}

On 10:08:25 temporal waked up workflow task worker with timeout error.

rustatian commented 2 years ago

@Torrion Could you send me a wf history?

rustatian commented 2 years ago

workflow task worker with timeout error.

You have a test exception in your code and a limited number of retries. That is expected to get a timeout because temporal can't finish WF.

Torrion commented 2 years ago

Here you are https://gist.github.com/Torrion/d1bab6a7ae41784550635551263154ed

You have a test exception in your code and a limited number of retries. That is expected to get a timeout because temporal can't finish WF.

Look at this wf log https://gist.github.com/Torrion/c6de20cd9d53ff93aeb725effa6844a0 There are exceptions occurred in events 42,66, 84 etc but no timeout. Event 116 timed out with no reason as for me.

rustatian commented 2 years ago

@Torrion It's ok to get an exception from PHP. Then, temporal will retry this execution if the exception is retryable. But, during your execution, you got too many of these exceptions (4):

RetryOptions::new()->withMaximumAttempts(5)

And timeout:

->withScheduleToCloseTimeout('5 minute')

You didn't attach a timestamp, but I guess that as you wrote before it was 5 min from the event 1 to 116.

rustatian commented 2 years ago

Also, the initial issue contains a report for the Unable to kill workflow because workflow process. Do you see these errors in the logs?

rustatian commented 2 years ago

And Got the response to undefined request 9030 as well.

Torrion commented 2 years ago

You didn't attach a timestamp

Updated both logs with time

Also, the initial issue contains a report for the Unable to kill workflow because workflow process. Do you see these errors in the logs?

No

And Got the response to undefined request 9030 as well.

And also, no

I guess that as you wrote before it was 5 min from the event 1 to 116

I think there are different activity runs produced these errors, but it needs more investigation and you're right, looks like another problem. Also this behavior occurred only if I use max_jobs instead of supervisor, which is more applicable for usage with temporal. I'll open a new issue if I found something.

However, sleeping for 5 minutes had place in tests with supervisor enabled. So original problem solved partially, imho.

rustatian commented 2 years ago

I think there are different activity runs produced these errors

For sure, they are different. Because you have a random error in your code: https://github.com/Torrion/temporal-worker-pool-leak-test/blob/main/app/src/Temporal/TestActivity.php#L18. Thus, your workflow execution has a problem: Random errors. As I mentioned before, generally, this is ok. But is not a good practice to have such workflows with 100+ activities (and 1mi in your case). Would be better to split that into smaller groups to let Temporal cache your executions.

I can guess, that you have bigger limits for the supervisor, so, the full resets occurred not very often, which let the workflow to complete even with test exception (using retries). You may achieve the same by increasing max_jobs by 2mi for example.

rustatian commented 2 years ago

So, the timeout in your case is an entirely expected behavior. WF execution errors are expected as well. To finish the WF anyway, use a bigger RR pool limit (max_jobs: 200000000+) or use a somewhat bigger ScheduleToClose timeout (like 9999 hours).

Torrion commented 2 years ago

Ok. I tried to launch a new test without exceptions, but got the same WorkflowTaskTimedOut and I don't understand why. If I just increase max_jobs/idle_ttl/exec_ttl/whatever it will mean that workers will be reloaded during a one of the next workflow executions, so this is not a solution.

I've updated code in

https://github.com/Torrion/temporal-worker-pool-leak-test

And here is logs when the error occurred on 20th second of workflow execution: https://gist.github.com/Torrion/61bc6e857dec1fb9160a2ff2a6dfdfaf

16:41:32 - workflow task is scheduled 16:41:37 - it timed out And no any activity between these points

rustatian commented 2 years ago

Ok. I tried to launch a new test without exceptions, but got the same WorkflowTaskTimedOut and I don't understand why.

Why you specify to restart workers every 10s (idle)? -> https://github.com/Torrion/temporal-worker-pool-leak-test/blob/main/.rr.yaml#L16

Torrion commented 2 years ago

Why you specify to restart workers every 10s (idle)?

For this test it doesn't metter, but in real app, activity may open some socket connections like DB etc, so them could become broken if idle for a long time. Ok, increased idle_ttl to 20s and errors are gone. But still do not understand, how idle_ttl setted up on activity worker affects workflow worker. Does workflow worker must continue to work, not depending on what happened in activity worker pool?

rustatian commented 2 years ago

@Torrion I tried your setup and don't have any Timeouts

rustatian commented 2 years ago

Ok, increased idle_ttl to 20s and errors are gone. Does workflow worker must continue to work, not depending on what happened in activity worker pool?

idle_ttl resets the whole execution. Because it restarts the RR workers with the Temporal workers. Thus some activities might be timeouted during the reset.

Torrion commented 2 years ago

idle_ttl resets the whole execution

But declaration points to activity workers, not workflow workers. If some activity workers are idle but workflow worker is not idle - why it must be rebooted? That's a bit confusing. It will be great if you somehow document this approach. One more thing. Correct me please if I'm wrong: WorkflowTaskTimedOut error in my logs means not timeout of execution itself, but that workflow worker stopped to respond (because it is rebooting) to temporal server and it decided to rerun workflow execution on another available worker.

rustatian commented 2 years ago

@Torrion

But declaration points to activity workers, not workflow workers. If some activity workers are idle but workflow worker is not idle - why it must be rebooted?

You can manage only activity workers from the configuration. RR Workflow worker is created internally and not available for configuration. WF worker is always on, there is no supervisor in it (it's a special type of worker). The only way to restart it - is to kill it directly or to reset it via the reset command.

WorkflowTaskTimedOut error in my logs means not timeout of execution itself

There is a term called - Sticky execution. When RR + Temporal executes your activities, temporal will put your execution in the sticky queue. But, when the idle_ttl or max_jobs limits are reached, RR purges the Sticky WF cache, stops all workers, and starts everything again. At the same time, Temporal sees that there are workers dedicated to the previous sticky queue, and you see these timeouts. That is not a Workflow timeout, but just a Workflow Task timeout (task will be rescheduled to the Normal queue and executed).

Torrion commented 2 years ago

WF worker is always on, there is no supervisor in it (it's a special type of worker). The only way to restart it - is to kill it directly or to reset it via the reset command.

But, when the idle_ttl or max_jobs limits are reached, RR purges the Sticky WF cache, stops all workers, and starts everything again.

I see a contradiction in this 2 statements. The question is why not supervisored WF pool is reacting on events in supervisored Activity pool. In my vision they should be completely independent from each other. But discussion about this will be an off-topic.

In sum, original issue is resolved. Everything works well on test stand and my project staging setup. Production testing is scheduled for next week.

Thank you very much for quick answers and fixes!

rustatian commented 2 years ago

@Torrion

I see a contradiction in this 2 statements.

Yeah, I'll clarify my statement. I wanted to describe a general approach w/o specifying many details. But, ok: Inside a roadrunner-temporal plugin, we have three types of workers: WF worker (only 1), Activity workers (configurable by user), and Temporal workers.

You can't restart these workers independently. It'll lead to errors in the bug description. RR process supervisor will track the conditions specified in the configuration (supervisor) and sends signals to the worker_watcher (idle_ttl, exec_ttl, max_memory, ttl). Worker watcher will gracefully stop the process and send an event about that. Roadrunner-temporal plugin listens for the process stopped events and calls a Reset internally because we should keep the state synced. So, I guess you understand now that the phrases:

The only way to restart it - is to kill it directly or reset it via the reset command.
when the idle_ttl or max_jobs limits are reached, RR purges the Sticky WF cache, stops all workers Describes the different parts of the system.

We will cover this in the docs, thank you for your questions.

If everything works well, feel free to close the bug 👍🏻

temporalio / roadrunner-temporal