Job worker randomly restarts?

krschacht commented 2 months ago

I've had SolidQueue setup and working for awhile on a low traffic app. But I sporadically get a job that takes a long time to complete (I just had one that took 7 minutes). When I check the logs, it looks like the job starts, then the worker gets killed and takes a bit to restart. After it restarts, it picks up the job again and completes it quickly (the usual 10 seconds). I'm struggling to figure out the root cause of this since my UI needs this job to update things so the user is just sitting there.

Notably, the pattern is the sam: My app is warmed up and running. I'm taking a series of user actions on the front-end which each trigger a job to complete. I'm having success many times in a row, which I can tell because the UI quickly updates, and then randomly the UI stops updating. I check the logs and see it's doing this weird restart thing.

My database is postgres. I'm deployed to Render. I'm running Solid Queue's supervisor together with Puma (using plugin :solid_queue).

Here is the log output for a recent failure. I tried to delete lines that were definitely unrelated. Also, any lines beginning with "### " are simple puts messages from within my job:

[SolidQueue] Claimed 1 jobs
I, [2024-04-29T18:16:29.093278 #67]  INFO -- : [ActiveJob] [GetNextAIMessageJob] [33930e21-87b8-4173-9c6e-edc5e39ecf4b] Performing GetNextAIMessageJob (Job ID: 33930e21-87b8-4173-9c6e-edc5e39ecf4b) from SolidQueue(default) enqueued at 2024-04-29T18:16:29.024459609Z with arguments: 3357, 3
### GetNextAIMessageJob.perform(3357, 3, 1)
### Wait for reply
[SolidQueue] Restarting fork[67] (status: )
[SolidQueue] Starting Worker(pid=91, hostname=srv-cn7qcf7sc6pc73cbk1ug-64d5657b7b-4kdb6, metadata={:polling_interval=>0.1, :queues=>"*", :thread_pool_size=>5})
[1] - Worker 0 (PID: 103) booted in 0.0s, phase: 0
I, [2024-04-29T18:18:21.404711 #103]  INFO -- : [6ee21674-2654-445a-a387-2c64c24f786c] Started GET "/cable" for 23.112.39.144 at 2024-04-29 18:18:21 +0000
I, [2024-04-29T18:18:21.413428 #103]  INFO -- : [6ee21674-2654-445a-a387-2c64c24f786c] Started GET "/cable" [WebSocket] for 23.112.39.144 at 2024-04-29 18:18:21 +0000
I, [2024-04-29T18:18:21.413564 #103]  INFO -- : [6ee21674-2654-445a-a387-2c64c24f786c] Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: websocket)
I, [2024-04-29T18:18:21.514497 #103]  INFO -- : Turbo::StreamsChannel is streaming from Z2lkOi8vaG9zdGVkLWdwdC9Db252ZXJzYXRpb24vMTI1
[SolidQueue] Pruning dead process 360 - {"polling_interval"=>0.1, "queues"=>"*", "thread_pool_size"=>5}
[SolidQueue] Claimed 1 jobs
I, [2024-04-29T18:23:10.309082 #91]  INFO -- : [ActiveJob] [GetNextAIMessageJob] [33930e21-87b8-4173-9c6e-edc5e39ecf4b] Performing GetNextAIMessageJob (Job ID: 33930e21-87b8-4173-9c6e-edc5e39ecf4b) from SolidQueue(default) enqueued at 2024-04-29T18:16:29.024459609Z with arguments: 3357, 3
### GetNextAIMessageJob.perform(3357, 3, 1)
### Wait for reply
### Finished GetNextAIMessageJob.perform(3357, 3)
I, [2024-04-29T18:23:24.835017 #91]  INFO -- : [ActiveJob] [GetNextAIMessageJob] [33930e21-87b8-4173-9c6e-edc5e39ecf4b] Performed GetNextAIMessageJob (Job ID: 33930e21-87b8-4173-9c6e-edc5e39ecf4b) from SolidQueue(default) in 14525.39ms
I, [2024-04-29T18:23:25.344451 #91]  INFO -- : [ActiveJob] Enqueued Turbo::Streams::BroadcastStreamJob (Job ID: b0083a97-908d-4568-85ec-1ebceb374adf) to SolidQueue(default) with arguments: "Z2lkOi8vaG9zdGVkLWdwdC9Db252ZXJzYXRpb24vMTI1", {:content=>"<turbo-stream action=\"refresh\"></turbo-stream>"}
[SolidQueue] Claimed 1 jobs
I, [2024-04-29T18:23:25.365729 #91]  INFO -- : [ActiveJob] [Turbo::Streams::BroadcastStreamJob] [b0083a97-908d-4568-85ec-1ebceb374adf] Performing Turbo::Streams::BroadcastStreamJob (Job ID: b0083a97-908d-4568-85ec-1ebceb374adf) from SolidQueue(default) enqueued at 2024-04-29T18:23:25.335394484Z with arguments: "Z2lkOi8vaG9zdGVkLWdwdC9Db252ZXJzYXRpb24vMTI1", {:content=>"<turbo-stream action=\"refresh\"></turbo-stream>"}
I, [2024-04-29T18:23:25.366418 #91]  INFO -- : [ActiveJob] [Turbo::Streams::BroadcastStreamJob] [b0083a97-908d-4568-85ec-1ebceb374adf] Performed Turbo::Streams::BroadcastStreamJob (Job ID: b0083a97-908d-4568-85ec-1ebceb374adf) from SolidQueue(default) in 0.87ms
I, [2024-04-29T18:23:26.311872 #103]  INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Started GET "/conversations/125/messages?version=1" for 108.162.245.166 at 2024-04-29 18:23:26 +0000
I, [2024-04-29T18:23:26.317496 #103]  INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Processing by MessagesController#index as HTML
I, [2024-04-29T18:23:26.317594 #103]  INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c]   Parameters: {"version"=>"1", "conversation_id"=>"125"}
[1] - Worker 1 (PID: 124) booted in 0.0s, phase: 0
I, [2024-04-29T18:23:28.218480 #103]  INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c]   Rendered layout layouts/application.html.erb (Duration: 455.2ms | Allocations: 129565)
I, [2024-04-29T18:23:28.219448 #103]  INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Completed 200 OK in 1902ms (Views: 448.0ms | ActiveRecord: 43.2ms | Allocations: 179138)

rosa commented 2 months ago

Hey @krschacht, how odd! The restart you see in the logs is the Supervisor realising one of the workers has terminated, and replacing it with another one. https://github.com/rails/solid_queue/blob/15408647f1780033dad223d3198761ea2e1e983e/lib/solid_queue/supervisor.rb#L150-L151 "Restarting" is the wrong word there, TBH, because it's actually replacing a terminated fork. I'll fix that in #208. In any case, what this indicates is that something outside Solid Queue is killing that worker 🤔 The lack of status code in your logs:

[SolidQueue] Restarting fork[67] (status: )

makes me think this might be a SIGKILL being sent to the worker.

Is it possible you're running something that kills processes that go over certain time or memory usage?

krschacht commented 2 months ago

Hi @rosa, thank you so much for the quick response! I recently moved from Heroku over to Render and I don't know this platform nearly as well, but I think your hunch is right that the app got killed because it exceed memory usage. I have a ticket into Render support to confirm this but that's my best read of my metrics. And indeed, if that's the case, then it looks like this issue is on me and has nothing to do with SolidQueue. I think it's safe to close out this Issue unless I get new clues that suggest otherwise.

While you're fixing the "Restarting ..." error message, it might be worth even adding a suggested cause and maybe add a conditional on blank exitstatus too. I'll add a note to the PR with a concrete suggestion.

rosa commented 2 months ago

Going to close this one as I already merged https://github.com/rails/solid_queue/pull/208, with improved logging so this is not as obscure. Thanks again for raising this one, @krschacht!

morgoth commented 1 month ago

Just a note that I had to debug problems with not enough memory on the k8s pods. Thanks to improved logging, ie:

│I, [2024-06-19T07:43:56.830679 #1]  INFO -- : SolidQueue-0.3.3 Replaced terminated Worker (3.7ms)  pid: 19, status: "no exit status set", pid_from_status: 19, signaled: true, stopsig: ni│
│l, termsig: 9, hostname: "processing-default-5db597f6bd-7t8cw"                                                                                                                            │
│I, [2024-06-19T07:43:56.878024 #58]  INFO -- : SolidQueue-0.3.3 Register Worker (46.3ms)  pid: 58, hostname: "processing-default-5db597f6bd-7t8cw"                                        │
│I, [2024-06-19T07:43:56.880683 #58]  INFO -- : SolidQueue-0.3.3 Started Worker (49.4ms)  pid: 58, hostname: "processing-default-5db597f6bd-7t8cw", polling_interval: 0.5, queues: "default│

it was easier to find the cause. Thank you!

rails / solid_queue

Job worker randomly restarts? #213