rails / solid_queue

Database-backed Active Job backend
MIT License
1.96k stars 132 forks source link

How to recover in-process yet abandoned jobs? #159

Open wflanagan opened 9 months ago

wflanagan commented 9 months ago

I have got some load on SolidQueue in my production app.

Our workers work in an environment where the worker can be killed at any time (cloud-type infrastructure). Because of this, over the course of time, we will develop some jobs that show as "in-process" but the worker that was running them has died. So, they have been "in process" for 13 days, etc.

I'm able to query the jobs and find the jobs that are in process but not assigned to any current active worker.

current_worker_ids = SolidQueue::Process.select(:id).where(kind: "Worker").map { |x| x.id }
SolidQueue::Job.joins(:claimed_execution).where(finished_at: nil).where.not(claimed_execution: {process_id: current_worker_ids} .where_assoc_not_exists(:failed_execution)

I've built a method to try to recover the jobs.

    def requeue_abandoned!
      count = 0
      total_to_queue = abandoned_in_progress_jobs_count
      logger.info "Requeuing #{total_to_queue} abandoned jobs"
      abandoned_in_progress_jobs.find_each do |job|
        job.claimed_execution.delete
        schedule = SolidQueue::ScheduledExecution.create_or_find_by!(job_id: job.id)
        schedule.update!(scheduled_at: Time.zone.now)
        logger.info "Requeued #{count} of #{total_to_queue} jobs" if count % 100 == 0
      end
      logger.info "Requeued #{count} of #{total_to_queue} jobs"
      true
    end

As you can see, it deletes the claimed execution. Then, it tries to find the scheduled execution and set its time to now to make it ready.

This seems to work. BUT, it throws a nasty error and 0 of my workers are now working.


2024-02-21 07:28:01.643 | DETAIL:  Key (job_id)=(109233) already exists. | Β 
-- | -- | --
Β  | Β  | 2024-02-21 07:28:01.643 | /usr/local/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/connection_adapters/postgresql_adapter.rb:894:in `exec_params': ERROR:  duplicate key value violates unique constraint "index_solid_queue_claimed_executions_on_job_id" (PG::UniqueViolation)

So, I'm wondering if I don't understand from reading the docs how these executions work. It COULD be that this error is somewhat unrelated to what I did above (which I did for about 5000 jobs). But, I'd guess this was related.

The documentation on the executions and stuff is pretty sparse, and I'm not sure I really "got" how this works. Any documentation help would help me get to the bottom of this.

I'd appreciate it.

rosa commented 9 months ago

Hey @wflanagan, we already have support for this situation that you describe:

If processes have no chance of cleaning up before exiting (e.g. if someone pulls a cable somewhere), in-flight jobs might remain claimed by the processes executing them. Processes send heartbeats, and the supervisor checks and prunes processes with expired heartbeats, which will release any claimed jobs back to their queues. You can configure both the frequency of heartbeats and the threshold to consider a process dead. See the section below for this.

When you start a new supervisor, or simply periodically, the supervisor checks for processes that have died and deregister them. Deregistering a process releases its claimed executions. Also, claimed executions have their own release method. Yours doesn't work because the job is already in the system and you can't just enqueue it again.

You'd need to investigate why workers are getting deleted from the solid_queue_processes table without releasing their claimed executions. If done through the regular flow, that happens in a transaction, so there's no way they can leave the executions behind. Something else might be going on there.

wflanagan commented 9 months ago

@rosa Hey thanks for the response. So, we likely don't get a SigINT or other signal before it just terminates. This is because the node or the process has violated some constraint and it will just kill the process and respawn it. The link you point to seems to look for a clean termination. When you're running on, for example, Spot instances on AWS, you don't get any sort of clean signal. The process is just gone.

Are you saying it cleans up in this situation when there's not an opportunity for a graceful shutdown?

rosa commented 9 months ago

Are you saying it cleans up in this situation when there's not an opportunity for a graceful shutdown?

Yes, exactly that, that's the exact situation I designed for, and this is what I copied from the link I provided, where both graceful and not clean termination are described, also in my comment:

If processes have no chance of cleaning up before exiting (e.g. if someone pulls a cable somewhere), in-flight jobs might remain claimed by the processes executing them. Processes send heartbeats, and the supervisor checks and prunes processes with expired heartbeats, which will release any claimed jobs back to their queues. You can configure both the frequency of heartbeats and the threshold to consider a process dead. See the section below for this.

When you start a new supervisor, or simply periodically, the supervisor checks for processes that have died and deregister them. Deregistering a process releases its claimed executions. Also, claimed executions have their own release method. Yours doesn't work because the job is already in the system and you can't just enqueue it again.

You'd need to investigate why workers are getting deleted from the solid_queue_processes table without releasing their claimed executions. If done through the regular flow, that happens in a transaction, so there's no way they can leave the executions behind. Something else might be going on there.

wflanagan commented 9 months ago

Ok. Maybe it was the version. I had "running jobs" for 18 days that were connected to workers that were no longer present. That's what got me going on this. How often are they "reassigned"?

Other than my code to query to find the working/missing ones, we're running a VERY stock SolidQueue setup. 1 queue, expecting it to "just work." I'll rewatch is all and see if it resolves.

rosa commented 9 months ago

Ok. Maybe it was the version.

This has been like this from the first version πŸ˜•

I had "running jobs" for 18 days that were connected to workers that were no longer present.

This is what seems wrong, and that I think happened because of something external. For Solid Queue, it's not possible to delete workers without deleting their assigned jobs because that's done in a transaction, via an after_destroy callback. If workers are deleted via destroy, no jobs would be left assigned. My guess is that you might have deleted the workers manually, leaving the claimed executions orphaned πŸ€”

wflanagan commented 7 months ago

So, I'll throw this out here. I don't know what it would be. But, the fact is that we get stranded jobs. We're running a pretty stock Postgresql.

CleanShot 2024-04-05 at 12 27 03@2x

And when i click one of these workers, I get the following in the UI.

CleanShot 2024-04-05 at 12 30 34@2x

rosa commented 7 months ago

@wflanagan, so, for Solid Queue, it's not possible to delete workers without deleting their assigned jobs because that's done in a transaction, via an after_destroy callback. If workers are deleted via destroy, no jobs would be left assigned. Is it possible for you to investigate what happened with worker with ID 594396? I imagine not, since it's been quite a while.

Also, if you want to clear things up and wait until this happens again so you have logs and can investigate, you can do that in your Rails console as:

SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release)

It's going to take a bit because you have so many, but it should release all those jobs.

Alternatively, if you have more recent orphaned jobs like this, you might be able to find logs for those πŸ€”

wflanagan commented 7 months ago

So, when we run in kubernetes, or any sort of cloud infrastructure, the infrastructure can just "go away" in a mostly unceremonious way. We used to have this often in Amazon where we'd run on spot instances. Those just terminate and you're left with your WIP/InProcess jobs hanging out there.

That's what's happening here. The process goes over budget on memory or cpu requests, and when that happens, the process is killed.

It's my understanding from our previous conversation that you handle this case. But, 1) I'm not seeing it, and 2) there might be miscommunication.

Any cloud infrastructure could/would have something like this happen. It's a pretty typical scenario.

But, you're saying this is supposed to be cleaned up?

The relative count of these, relative to the total jobs performed, is low. BUT, it's not 0. And, from our previous conversation, I stopped working on a way to reclaim these jobs.

Is that still the case?

On Fri, Apr 5, 2024 at 2:52β€―PM Rosa Gutierrez @.***> wrote:

@wflanagan https://github.com/wflanagan, so, for Solid Queue, it's not possible to delete workers without deleting their assigned jobs because that's done in a transaction https://github.com/basecamp/solid_queue/blob/d71aa8ad6dd435fc458fc07be5ecc3b0cf4e89de/app/models/solid_queue/process.rb#L12, via an after_destroy callback. If workers are deleted via destroy, no jobs would be left assigned. Is it possible for you to investigate what happened with worker with ID 594396? I imagine not, since it's been quite a while.

Also, if you want to clear things up and wait until this happens again so you have logs and can investigate, you can do that in your Rails console as:

SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release)

It's going to take a bit because you have so many, but it should release all those jobs.

Alternatively, if you have more recent orphaned jobs like this, you might be able to find logs for those πŸ€”

β€” Reply to this email directly, view it on GitHub https://github.com/rails/solid_queue/issues/159#issuecomment-2040444775, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAJZIWJ3DKE35H5RFDGSLY33XHRAVCNFSM6AAAAABDS6VMW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBQGQ2DINZXGU . You are receiving this because you were mentioned.Message ID: @.***>

rosa commented 7 months ago

Any cloud infrastructure could/would have something like this happen. It's a pretty typical scenario.

Yes, it can also happen in our case, and why we handle this via heartbeats and having the supervisor clean up both when it starts and periodically.

But, you're saying this is supposed to be cleaned up?

Yes.

If processes have no chance of cleaning up before exiting (e.g. if someone pulls a cable somewhere), in-flight jobs might remain claimed by the processes executing them. Processes send heartbeats, and the supervisor checks and prunes processes with expired heartbeats, which will release any claimed jobs back to their queues. You can configure both the frequency of heartbeats and the threshold to consider a process dead. See the section below for this.

The processes are being deleted, at least, because you don't have them, so I'd like to know how processes are being deleted for you without their claimed executions not being released as well. The process + its claimed executions get deleted together in a DB transaction, so something must be interfering with that.

wflanagan commented 7 months ago

First, let me say i appreciate your efforts on this. I know it can be thankless. I used to work on the Cequel gem.

The processes are being deleted, at least, because you don't have them, so I'd like to know how processes are being deleted for you without their claimed executions not being released as well. The process + its claimed executions get deleted together in a DB transaction, so something must be interfering with that.

I don't quite understand what you're asking. We have a node running in our infrastructure that is running a worker process. That worker process is killed via the equivalent of kill -9 ${PID}. It's immediately terminated. Whatever is set in the DB table is still present.

What I understand is that the process 'record' and the execution are together in a transaction. But, the process that is executing that transaction just disappears. I am not a DB expert. But, what would roll this record back? There's nothing to do it. The process that was running it was killed.

The entire process in the worker that runs is run with bundle exec solidqueue:start. This entire process gets killed. So, wouldn't that kill the 5 threads and the dispatcher?

FYI, we aren't specifically deleting any jobs or any records in the other tables.

One other thought. We are going through PgBouncer. But, don't know why that would be involved.

rosa commented 7 months ago

Thank you!

The entire process in the worker that runs is run with bundle exec solidqueue:start. This entire process gets killed. So, wouldn't that kill the 5 threads and the dispatcher?

Yes, that's right. The process would be killed, and the DB records that correspond to these processes, that is, the rows in the solid_queue_processes table would remain. Then, when you start the supervisor again via solid_queue:start, this code would run, which in turn, runs this, and then this. Ultimately, that would run: https://github.com/rails/solid_queue/blob/108aea98f50b3956d061b8a3c6d7244061f96005/app/models/solid_queue/process.rb#L22-L27 And https://github.com/rails/solid_queue/blob/108aea98f50b3956d061b8a3c6d7244061f96005/app/models/solid_queue/process.rb#L12

So, whenever a process record gets destroyed (independently of when the actual, real process was killed), its associated claimed executions would get deleted too, as part of the same DB transaction. In your case, you have process records that have been somehow deleted but you still have claimed executions associated to these records. I'm not sure how this can happen, and have never seen it happen, so this is what I'm trying to figure out and wondering if you had any logs or any way to find out how the missing process records were deleted in the first place.

wflanagan commented 7 months ago

We have PG logs, which COULD be helpful or not. Tracking down the specific logs might be challenging though. I'll see if I can dig something up.

danwetherald commented 6 months ago

We are seeing this same behavior pretty consistently - Jobs will get stuck in processing that are assigned to a worker that no longer exists, completely restarting does not correct the job to worker assignments.

arjun810 commented 6 months ago

We are also seeing this sometimes, after killing the worker in dev. Something with the transaction seems to not be working as expected..

danwetherald commented 6 months ago

I believe this is basically what we are experiencing as well @arjun810

morgoth commented 6 months ago

Hey, just a mention that we see similar issue on production k8s. The jobs where in "in_progress" and assigned to the worker that didn't exist. Manual release worked in this case. solid_queue (0.3.1) ruby 3.3.1

rosa commented 6 months ago

We've never encountered this, and I haven't been able to reproduce it, so I'm a bit lost on what might be happening. If anyone could grab DB logs from the moment the worker gets deleted, that'd be really useful to find and fix the root cause.

danwetherald commented 6 months ago

When developing, this might be caused by workers dying when your mac hibernates and resumes. I noticed when coming back to my mac this morning, the workers tab had a strange state of 1 / 0 in the count. My guess is that the worker was killed but since it happen when the mac was hibernated, it did not run any lifecycle events allowing solid queue to correct the issue. Restarting solid_queue prior to queuing any new jobs corrects the issue.

Again, just an idea to what is possibly happening.

rosa commented 6 months ago

Thanks a lot @dan003400! That's indeed likely, but in that case, the worker record in the solid_queue_processes table should have remained as well, not just the claimed jobs πŸ˜•

I'm trying to reproduce this by doing the following, using this job. First, I enqueue 10 of these, so that they take 1 minute when being performed:

>> 10.times { StoreResultJob.perform_later(42, pause: 60.seconds) }

Then I start Solid Queue with this config:

workers:
  - queues: background
    threads: 3
  - queues: default
    threads: 5
4 bin/rails solid_queue:start
SolidQueue-0.3.1 Register Supervisor (36.0ms)  pid: 63970, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Prune dead processes (19.8ms)  size: 0
SolidQueue-0.3.1 Register Dispatcher (28.7ms)  pid: 63985, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Dispatcher (29.7ms)  pid: 63985, hostname: "Rosas-MacBook-Air.local", polling_interval: 1, batch_size: 500, concurrency_maintenance_interval: 600, recurring_schedule: nil
SolidQueue-0.3.1 Register Worker (28.6ms)  pid: 63986, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (28.7ms)  pid: 63987, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (29.3ms)  pid: 63986, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Started Worker (29.4ms)  pid: 63987, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
SolidQueue-0.3.1 Unblock jobs (2.0ms)  limit: 500, size: 0

And now I kill the workers with a SIGKILL, so there's no chance to do any cleanup, and they're restarted by the supervisor:

$ kill -9 63986
$ kill -9 63987
SolidQueue-0.3.1 Replaced terminated Worker (5.6ms)  pid: 63986, status: "no exit status set", pid_from_status: 63986, signaled: true, stopsig: nil, termsig: 9, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (58.1ms)  pid: 66314, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (60.2ms)  pid: 66314, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Replaced terminated Worker (1.1ms)  pid: 63987, status: "no exit status set", pid_from_status: 63987, signaled: true, stopsig: nil, termsig: 9, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (39.1ms)  pid: 66653, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (39.9ms)  pid: 66653, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5

Let's check claimed jobs and registered workers:

>> SolidQueue::ClaimedExecution.count
  SolidQueue::ClaimedExecution Count (4.1ms)  SELECT COUNT(*) FROM `solid_queue_claimed_executions`
=> 4

>> SolidQueue::ClaimedExecution.all.all? { |c| c.process.present? }
  SolidQueue::ClaimedExecution Load (2.3ms)  SELECT `solid_queue_claimed_executions`.* FROM `solid_queue_claimed_executions`
  SolidQueue::Process Load (1.4ms)  SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
  SolidQueue::Process Load (1.4ms)  SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
  SolidQueue::Process Load (0.6ms)  SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
  SolidQueue::Process Load (0.5ms)  SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 9 LIMIT 1
=> true

As expected, all claimed executions have a registered process. Note that 3 of them are still claimed by one of the killed workers (process ID 7):

=>
#<SolidQueue::Process:0x000000011ea90640
 id: 7,
 kind: "Worker",
 last_heartbeat_at: Wed, 29 May 2024 19:04:05.633951000 UTC +00:00,
 supervisor_id: 5,
 pid: 63986,
 hostname: "Rosas-MacBook-Air.local",
 metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3},
 created_at: Wed, 29 May 2024 19:03:05.603323000 UTC +00:00>

Now, let's kill the supervisor by sending a SIGKILL signal:

$ kill -9 63970

Workers and dispatcher realise the supervisor is gone, so they shut down orderly and clean up:

SolidQueue-0.3.1 Deregister Worker (10.3ms)  process_id: 9, pid: 66314, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:07:26.976856000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 9, kind: "Worker", last_heartbeat_at: "2024-05-29 19:07:26.976856000 +0000", supervisor_id: 5, pid: 66314, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3}, created_at: "2024-05-29 19:04:26.869299000 +0000">, :pruned=>false, :claimed_size=>0}
SolidQueue-0.3.1 Shut down Worker (22.3ms)  pid: 66314, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Release claimed jobs (3.3ms)  size: 0
rosas-air 3.3.1 ~/Work/basecamp/solid_queue/test/dummy main *= SolidQueue-0.3.1 Deregister Worker (9.1ms)  process_id: 10, pid: 66653, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:07:35.982878000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 10, kind: "Worker", last_heartbeat_at: "2024-05-29 19:07:35.982878000 +0000", supervisor_id: 5, pid: 66653, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"default", "thread_pool_size"=>5}, created_at: "2024-05-29 19:04:35.894628000 +0000">, :pruned=>false, :claimed_size=>0}
SolidQueue-0.3.1 Shut down Worker (25.3ms)  pid: 66653, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
SolidQueue-0.3.1 Release claimed jobs (9.0ms)  size: 0
SolidQueue-0.3.1 Deregister Dispatcher (25.4ms)  process_id: 6, pid: 63985, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:08:05.709912000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 6, kind: "Dispatcher", last_heartbeat_at: "2024-05-29 19:08:05.709912000 +0000", supervisor_id: 5, pid: 63985, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>1, "batch_size"=>500, "concurrency_maintenance_interval"=>600}, created_at: "2024-05-29 19:03:05.602736000 +0000">, :pruned=>false, :claimed_size=>0}
SolidQueue-0.3.1 Shut down Dispatcher (49.1ms)  pid: 63985, hostname: "Rosas-MacBook-Air.local", polling_interval: 1, batch_size: 500, concurrency_maintenance_interval: 600, recurring_schedule: nil

We still have the 3 claimed executions by the worker that I killed individually:

>> SolidQueue::ClaimedExecution.count
  SolidQueue::ClaimedExecution Count (2.9ms)  SELECT COUNT(*) FROM `solid_queue_claimed_executions`
=> 3

But the worker record is still there as expected:

>> SolidQueue::ClaimedExecution.first.process
  SolidQueue::ClaimedExecution Load (3.6ms)  SELECT `solid_queue_claimed_executions`.* FROM `solid_queue_claimed_executions` ORDER BY `solid_queue_claimed_executions`.`id` ASC LIMIT 1
  SolidQueue::Process Load (1.3ms)  SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
=>
#<SolidQueue::Process:0x000000011ea9e240
 id: 7,
 kind: "Worker",
 last_heartbeat_at: Wed, 29 May 2024 19:04:05.633951000 UTC +00:00,
 supervisor_id: 5,
 pid: 63986,
 hostname: "Rosas-MacBook-Air.local",
 metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3},
 created_at: Wed, 29 May 2024 19:03:05.603323000 UTC +00:00>

Now, if I start the supervisor again, one of the first thing it does is to run the process prune, releasing claimed executions and deregistering the process with ID 7, whose heartbeat has expired already:

$ bin/rails solid_queue:start
SolidQueue-0.3.1 Register Supervisor (39.7ms)  pid: 78841, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (28.8ms)  pid: 78857, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (29.5ms)  pid: 78857, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Register Worker (29.0ms)  pid: 78858, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Dispatcher (29.8ms)  pid: 78856, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (29.9ms)  pid: 78858, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
SolidQueue-0.3.1 Started Dispatcher (31.0ms)  pid: 78856, hostname: "Rosas-MacBook-Air.local", polling_interval: 1, batch_size: 500, concurrency_maintenance_interval: 600, recurring_schedule: nil
SolidQueue-0.3.1 Unblock jobs (3.0ms)  limit: 500, size: 0
SolidQueue-0.3.1 Release claimed job (18.7ms)  job_id: 4, process_id: 7
SolidQueue-0.3.1 Release claimed job (15.0ms)  job_id: 5, process_id: 7
SolidQueue-0.3.1 Release claimed job (6.6ms)  job_id: 6, process_id: 7
SolidQueue-0.3.1 Release claimed jobs (63.9ms)  size: 3
SolidQueue-0.3.1 Deregister Worker (70.2ms)  process_id: 7, pid: 63986, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:04:05.633951000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 7, kind: "Worker", last_heartbeat_at: "2024-05-29 19:04:05.633951000 +0000", supervisor_id: 5, pid: 63986, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3}, created_at: "2024-05-29 19:03:05.603323000 +0000">, :pruned=>true, :claimed_size=>3}
SolidQueue-0.3.1 Release claimed jobs (1.5ms)  size: 0
SolidQueue-0.3.1 Deregister Worker (5.4ms)  process_id: 8, pid: 63987, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:04:05.634434000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 8, kind: "Worker", last_heartbeat_at: "2024-05-29 19:04:05.634434000 +0000", supervisor_id: 5, pid: 63987, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"default", "thread_pool_size"=>5}, created_at: "2024-05-29 19:03:05.602737000 +0000">, :pruned=>true, :claimed_size=>0}
SolidQueue-0.3.1 Prune dead processes (111.3ms)  size: 2

Relevant lines:

SolidQueue-0.3.1 Release claimed job (18.7ms)  job_id: 4, process_id: 7
SolidQueue-0.3.1 Release claimed job (15.0ms)  job_id: 5, process_id: 7
SolidQueue-0.3.1 Release claimed job (6.6ms)  job_id: 6, process_id: 7
SolidQueue-0.3.1 Release claimed jobs (63.9ms)  size: 3
SolidQueue-0.3.1 Prune dead processes (111.3ms)  size: 2

I can't manage to leave solid_queue_claimed_executions without a corresponding solid_queue_processes record associated πŸ˜•

danwetherald commented 6 months ago

When coming back to my mac today after the mac was hibernated all night.

There is a job stuck in progress assigned to Worker 304, when you click on the lists of Workers, the only worker present is Worker 339 PID: 10361 - restarting solid queue does not correct this issue and the job will never get assigned the new worker 339 - it will stay in progress for ever with its worker 304 assignment.

The only way I have been able to correct this in the past is by deleting the rows in solid_queue_claimed_executions table.

image

image

image

image

rosa commented 6 months ago

@dan003400, no clue/logs about how worker 304 got deleted? πŸ˜• Is it possible you have that in your log/development.log file? You could search for DELETE queries to the solid_queue_processes table, to see if you spot when the worker got deleted.

wflanagan commented 5 months ago

When developing, this might be caused by workers dying when your mac hibernates and resumes. I noticed when coming back to my mac this morning, the workers tab had a strange state of 1 / 0 in the count. My guess is that the worker was killed but since it happen when the mac was hibernated, it did not run any lifecycle events allowing solid queue to correct the issue. Restarting solid_queue prior to queuing any new jobs corrects the issue.

Again, just an idea to what is possibly happening.

This is running in production on Kubernetes. So, it's not a machine hibernating. Good thought though.

wflanagan commented 5 months ago

@rosa I am going to private message you a video in our process. I haven't done anything to clear them up yet, as I am hoping i can get some feedback and advice that can help us troubleshoot this case.

wflanagan commented 5 months ago

@rosa related to this, i'm wondering, when ANY worker restarts, does it clean up any claimed executions that are now missing and outside of the heartbeat window? Is there some sort of assumption that a worker has some sort of static logic, and the "rebooted" worker would pick up the same work?

Just brainstorming why it might be happening.

Because, if any worker cleans up any old execution records, then this shouldn't be happening.

rosa commented 5 months ago

Thanks @wflanagan!

when ANY worker restarts, does it clean up any claimed executions that are now missing and outside of the heartbeat window? Is there some sort of assumption that a worker has some sort of static logic, and the "rebooted" worker would pick up the same work?

No, not really. When a worker is deleted from the DB, it releases its claimed executions. A rebooted worker will have a different ID and will know nothing about other executions. The piece I'm missing here, basically, is how a worker record can get deleted without releasing its executions as well, and what I haven't been able to reproduce in any way. The Process model in Solid Queue (which represents workers and dispatchers) has this callback:

after_destroy -> { claimed_executions.release_all }

So, both releasing claimed executions and deleting the worker should happen within a transaction, which is why I haven't been able to reproduce what's happening here: a worker (process) getting deleted, leaving all executions behind.

A way this could happen is that the worker gets deregistered from the DB while some executions are still being claimed, in a way that claimed_executions in the callback is empty because the claiming is happening in a parallel transaction that hasn't been committed yet. But this would mean the record in the DB is being deleted before the process is actually shut down, which can only happen if you delete these manually somehow, because we'd only deregister the process (delete the DB record) if:

I'm going to make some changes regardless to see if they help with this scenario, but it's tricky without being able to reproduce it or having an explanation why this is happening. If someone could find some logs for the missing workers being deleted, that'd shed some light for sure.

rosa commented 5 months ago

I think #244 should help with this.

wflanagan commented 5 months ago

@rosa You're a trooper. Thank you. I'll try it out and let you know.

wflanagan commented 5 months ago

@rosa I pointed at this repo and updated production. Two things. 1) The existing claimed executions that were previously there are still present, unfortunately. So, at least in terms of cleaning up, that didn't seem to work. 2) I tried the command SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release) per your statement back in April. This actually throws an error which I'll list below.

SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release)
irb(main):002> 
(irb):1:in `<main>': undefined method `id' for nil:NilClass (NoMethodError)

    SolidQueue.instrument(:release_claimed, job_id: job.id, process_id: process_id) do

So, It seems that the job.id isn't found. So, it can't release the process.

wflanagan commented 5 months ago

I did more research. So, what's happening in my system is this.

The worker process is actually still there, but it's "idling." In kubernetes, the pod was present. I went into the pod and found there was no work being done, that it was idle.

When i killed the pod, it immediately released all the in process jobs that were outstanding. So, this works. I guess the question becomes, and maybe it's related to my own infrastructure, is why the process is still present and waiting, showing no activity, but the job doesn't perceive itself as completed?

I'm still working on it, but i wanted to give you the update.

rosa commented 5 months ago

So, It seems that the job.id isn't found. So, it can't release the process.

Hmmm... this doesn't seem possible, because there's a foreign key:

add_foreign_key "solid_queue_claimed_executions", "solid_queue_jobs", column: "job_id", on_delete: :cascade

so you can't have a row in solid_queue_claimed_executions table that points to a missing record in solid_queue_jobs. I assume that this is the reason that the new branch's cleanup didn't work, it must have failed in the same way as you running this manually, but how you got to this state seems impossible from the DB point of view. Unless you've perhaps skipped the foreign keys?

wflanagan commented 5 months ago

https://cln.sh/TyXXZ0b9

The foreign key is in the schema.

rosa commented 5 months ago

@wflanagan, and does this return any records for you? πŸ˜•

SolidQueue::ClaimedExecution.where.missing(:job)
wflanagan commented 4 months ago

@rosa No. Empty array.

espenbff commented 2 months ago

Hi, I have a similar problem to the one described above. My worker gets shut down but the job gets stuck "In Progress". This happens when I run a job that runs for a while (1 minute++) and it does a lot of sql and api calls.

This is the terminal output when it shuts down: 10:20:57 web.1 | SolidQueue-0.6.0 Error deregistering Worker (102.1ms) process_id: 370, pid: 86548, hostname: "MacBook-Pro.local", name: "worker-65fdf60c40030f3e9ac0", last_heartbeat_at: "2024-09-02T08:19:55Z", claimed_size: nil, pruned: false, error: "ActiveRecord::ConnectionTimeoutError could not obtain a connection from the pool within 0.100 seconds (waited 0.101 seconds); all pooled connections were in use" 10:20:57 web.1 | SolidQueue-0.6.0 Shutdown Worker (5104.3ms) pid: 86548, hostname: "MacBook-Pro.local", process_id: 370, name: "worker-65fdf60c40030f3e9ac0", polling_interval: 0.1, queues: "*", thread_pool_size: 1 10:20:57 web.1 | [ActiveJob] [FeedbackImports::NeedsFeedbackJob] [78cca2cc-3484-4ca8-a622-7ebdb46e2011] Performed FeedbackImports::NeedsFeedbackJob (Job ID: 78cca2cc-3484-4ca8-a622-7ebdb46e2011) from SolidQueue(default) in 12434.0ms 10:20:57 web.1 | /Users/.../.rvm/gems/ruby-3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:128:inblock in wait_poll': could not obtain a connection from the pool within 0.100 seconds (waited 0.101 seconds); all pooled connections were in use (ActiveRecord::ConnectionTimeoutError)`

SolidQueue::ClaimedExecution.where.missing(:job) Does not return anything.

image

Worker with id '370' not found

If I restart my application the job gets picked up again, and runs til finished or shutdown again. I have tried tweaking Thread and polling intervals for the solid_queue setup as well as thread count in my database.yml file without any noticeable effect .

Hope this somewhat helps with the problem over. I can provide more information if needed.

rosa commented 2 months ago

@espenbff, thanks for writing this up πŸ™ what's the size of your database connection pool? You'll find this in your database.yml configuration, as the pool option.

espenbff commented 2 months ago

It is set to 5 pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>. I tried one run with value 20 as well but I got the same error.

default: &default
  adapter: postgresql
  encoding: unicode
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 20 } %>

development:
  <<: *default
  database: name_development

I see I overwrite it with 5 anyway. Will do a new test with 20.

rosa commented 2 months ago

Got it! Yes, it'll depend on how you have your workers configured as well (the number of threads you're using).

wflanagan commented 1 week ago

so, I'll state that this is better, maybe. But, definitely still not solved. I'm wondering if there's some sort of callback diagram or something that will help us poke and prod the database to write things that don't cause error escaleations when we're trying to recover.

We're running into this now:

I, [2024-11-21T21:39:27.546444 #7]  INFO -- : SolidQueue-1.0.0 Shutdown Supervisor (0.4ms)  pid: 7, hostname: "audienti-worker-54c6d48f47-n5lsx", process_id: nil, name: "supervisor-579937326cb6fe89c75d"
rake aborted!
NoMethodError: undefined method `supervisees' for nil:NilClass (NoMethodError)

        if registered_process = process.supervisees.find_by(name: terminated_fork.name)
                                       ^^^^^^^^^^^^
/usr/local/bundle/ruby/3.2.0/gems/solid_queue-1.0.0/lib/solid_queue/supervisor.rb:174:in `handle_claimed_jobs_by'

when our jobs get shutdown, or processed, this is now throwing errors, causing our system to thrash its workers.

I will admit, i'm getting to my wits end. We love the idea of SolidQueue, but there have been a LOT of problems, and we still don't have them addressed.

What we are trying to do is this:

 SolidQueue::Process.where(kind: "Worker").where("last_heartbeat_at < ?", 10.minutes.ago).each do |process|
        process.claimed_executions.each do |ce|
          ce.destroy!
        end
      end

with the theory being that deleting the claimed execution will allow us to requeue it.

Help!

rosa commented 3 days ago
 SolidQueue::Process.where(kind: "Worker").where("last_heartbeat_at < ?", 10.minutes.ago).each do |process|
        process.claimed_executions.each do |ce|
          ce.destroy!
        end
      end

This will leave jobs in an inconsistent state, in limbo. I suspect all problems you have where things are in an inconsistent state might have to do with this kind of operation, where some data is deleted in a way that the system doesn't expect.

I'm wondering if there's some sort of callback diagram or something that will help us poke and prod the database to write things that don't cause error escaleations when we're trying to recover.

Just look at the code. It's really clear and rather small. Make sure you're in the last version and just check the code.

wflanagan commented 2 days ago

I'll investigate again.

The thing is, I think that in cloud, processes/pods/workers can "just disappear." I believe earlier that you stated that the workers "should" SIGQUIT then, eventually might get SIGKILL'd.

Well, we might get a SIGQUIT or we might not. So, if cleanup is dependent on a graceful shutdown, this won't ever work in a cloud/kubernetes environment, at least as AWS, Google, or a private kubernetes cloud does it.

I'll keep trying to figure this out.. but it's been a problem since the beginning. We're really not doing anything. We're just running a lot of worker processes with fairly long running times 1-3 minutes in Pods on Kubernetes. There's nothing magical about what we're doing. And, we're on the most recent (1.0+) version.

rosa commented 2 days ago

I believe earlier that you stated that the workers "should" SIGQUIT then, eventually might get SIGKILL'd.

No, not exactly. The supervisor, upon being requested termination via TERM, will send first a TERM signal to the workers, then a QUIT if they haven't finished shutting down within the configured shutdown_timeout. This is all written in the README. Any KILL signals to the workers would come from outside. If the supervisor is the one killed, the workers will notice that and will also terminate orderly.

if cleanup is dependent on a graceful shutdown

Cleanup will happen no matter what. The original issue here was jobs left abandoned, in limbo. No matter how a worker is terminated, jobs will be either released back to their queue or marked as failed. No jobs will ever be left abandoned.

In k8s you can use container lifecycle events to decide what to do when a pod is going to shut down. Sure, some might just crash in a bad way or simply disappear, but that shouldn't be the norm. You also mentioned deploys causing this, and in that case, there should be always a controlled termination.