Open gczh opened 1 week ago
Looks like the workers processing these jobs are crashing or being killed somehow. Can you access logs in that instance to see what it might be?
Locally, it seems that the recurring job is getting enqueued but never executing.
Do you have local logs for what your Solid Queue worker is doing?
Looks like the workers processing these jobs are crashing or being killed somehow. Can you access logs in that instance to see what it might be?
@rosa i'm trying that now. Looking at the docs to see how I could mute the solid_queue logs as they're drowning out my local logs 😆
config.solid_queue.silence_polling
option is set to true
by default, so you shouldn't see the polling queries (which are usually the most noisy). Then you can also set a separate logger for Solid queue via config.solid_queue.logger
, by default is Rails.logger
. You can use a higher level there, such as:
config.solid_queue.logger = ActiveSupport::Logger.new(STDOUT, level: :info)
Or if you want to mute the Solid Queue logs completely, you can set:
config.solid_queue.logger = ActiveSupport::Logger.new(nil)
@rosa thanks for those! I managed to get use the logs to figure out what was happening.
What happened was that the job was trying to find records to sync, but there weren't any (due to a faulty where clause)
However, it's still odd that the job would throw a SolidQueue::Processes::ProcessExitError
instead of being marked as completed.
However, it's still odd that the job would throw a
SolidQueue::Processes::ProcessExitError
instead of being marked as completed.
Yes, that's what would happen. SolidQueue::Processes::ProcessExitError
is not thrown by the job. That's done by Solid Queue when the worker that's processing that job terminates ungraciously. In that case above, it terminated with an error exit code (1). Something external to Solid Queue is most likely killing that worker, but you might not see that locally because it might happen only in your Render instance. That's where you need to investigate to find out why the worker is crashing like that.
@rosa When we're deploying new version of our app, we kill off the queue by running sudo systemctl restart solid_queue
which effectively runs a /bin/kill -TSTP $MAINPID
command that is triggering a SolidQueue::Processes::ProcessExitError
for any jobs that are currently executing/running.
Is there a more graceful way to restart the queue that we should be doing? We have all the processes (supervisor, scheduler, workers, etc) running on the one machine.
Appreciate your thoughts on graceful shutdown of executing jobs. Especially as we have some long running ones.
Hey @dglancy! Could you send a TERM
or INT
signal to the process as described here? That, combined with SolidQueue.shutdown_timeout
, will try a graceful termination first, so workers won't pick any more jobs, and will try to finish what they're doing. If they can't finish, the jobs will be released back and shouldn't fail with a ProcessExitError
.
I setup my
solid_queue.yml
to run a recurring task every 30s that checks for records to syncThe
SyncRecordsJob
does aSyncBatchJob.perform_later(record_ids)
and it's expected to take some time to run as they have to process quite a number of records (aprox 30s - 1min).This has been deployed to Render on a $25/mo instance with 1 CPU and 2GB RAM. Initially, on deploy, some of the jobs execute successfully after some time,
SyncRecordsJob
accumulates in theIn Progress
list and seems to never process. They don't seem to accumulate anymore either.~Locally, these jobs run well and don't seem to have any issues.~ Locally, it seems that the recurring job is getting enqueued but never executing. Here's what I see from Mission Control in production for the jobs that failed:
Questions