Closed krschacht closed 2 months ago
Hey @krschacht, how odd! The restart you see in the logs is the Supervisor realising one of the workers has terminated, and replacing it with another one. https://github.com/rails/solid_queue/blob/15408647f1780033dad223d3198761ea2e1e983e/lib/solid_queue/supervisor.rb#L150-L151 "Restarting" is the wrong word there, TBH, because it's actually replacing a terminated fork. I'll fix that in #208. In any case, what this indicates is that something outside Solid Queue is killing that worker 🤔 The lack of status code in your logs:
[SolidQueue] Restarting fork[67] (status: )
makes me think this might be a SIGKILL
being sent to the worker.
Is it possible you're running something that kills processes that go over certain time or memory usage?
Hi @rosa, thank you so much for the quick response! I recently moved from Heroku over to Render and I don't know this platform nearly as well, but I think your hunch is right that the app got killed because it exceed memory usage. I have a ticket into Render support to confirm this but that's my best read of my metrics. And indeed, if that's the case, then it looks like this issue is on me and has nothing to do with SolidQueue. I think it's safe to close out this Issue unless I get new clues that suggest otherwise.
While you're fixing the "Restarting ..." error message, it might be worth even adding a suggested cause and maybe add a conditional on blank exitstatus
too. I'll add a note to the PR with a concrete suggestion.
Going to close this one as I already merged https://github.com/rails/solid_queue/pull/208, with improved logging so this is not as obscure. Thanks again for raising this one, @krschacht!
Just a note that I had to debug problems with not enough memory on the k8s pods. Thanks to improved logging, ie:
│I, [2024-06-19T07:43:56.830679 #1] INFO -- : SolidQueue-0.3.3 Replaced terminated Worker (3.7ms) pid: 19, status: "no exit status set", pid_from_status: 19, signaled: true, stopsig: ni│
│l, termsig: 9, hostname: "processing-default-5db597f6bd-7t8cw" │
│I, [2024-06-19T07:43:56.878024 #58] INFO -- : SolidQueue-0.3.3 Register Worker (46.3ms) pid: 58, hostname: "processing-default-5db597f6bd-7t8cw" │
│I, [2024-06-19T07:43:56.880683 #58] INFO -- : SolidQueue-0.3.3 Started Worker (49.4ms) pid: 58, hostname: "processing-default-5db597f6bd-7t8cw", polling_interval: 0.5, queues: "default│
it was easier to find the cause. Thank you!
I've had SolidQueue setup and working for awhile on a low traffic app. But I sporadically get a job that takes a long time to complete (I just had one that took 7 minutes). When I check the logs, it looks like the job starts, then the worker gets killed and takes a bit to restart. After it restarts, it picks up the job again and completes it quickly (the usual 10 seconds). I'm struggling to figure out the root cause of this since my UI needs this job to update things so the user is just sitting there.
Notably, the pattern is the sam: My app is warmed up and running. I'm taking a series of user actions on the front-end which each trigger a job to complete. I'm having success many times in a row, which I can tell because the UI quickly updates, and then randomly the UI stops updating. I check the logs and see it's doing this weird restart thing.
My database is postgres. I'm deployed to Render. I'm running Solid Queue's supervisor together with Puma (using
plugin :solid_queue
).Here is the log output for a recent failure. I tried to delete lines that were definitely unrelated. Also, any lines beginning with "### " are simple
puts
messages from within my job: