ngageoint / scale

Processing framework for containerized algorithms
http://ngageoint.github.io/scale/
Apache License 2.0
105 stars 45 forks source link

Scheduler Logic and Finding Nodes To Schedule On #1851

Closed emimaesmith closed 4 years ago

emimaesmith commented 4 years ago

Description Updates to the scheduler logic to attempt to fix #1801 have resulted in inconsistencies with jobs actually being scheduled. Debugging Logs have indicated this could be an issue with finding the proper resources on which to schedule new job executions.

Reproduction Steps Steps to reproduce the problem:

  1. Deploy the scale code from the scheduler branch
  2. Watch as the queue grows but nothing new is scheduled
  3. IF you look at the scheduler logs, you'll see tasks being starved of resources or failing to start within 2 minutes.

Expected behavior The queue should be populated up to it's QUEUE_LIMIT. If any job type is seen as unschedulable (due to resources, scheduled limits met, etc) all of those types should be skipped in that cycle of scheduling.

Additional context Adding logging statements indicate that jobs are not being matched properly to nodes. Most scoring of nodes to jobs come back as 0 even though much of the nodes reported do not have any active jobs running on them. This is causing the queue to grow indefinitely with only system jobs (such as ingests and strikes) being performed. Possibly related to #1825

emimaesmith commented 4 years ago

I have not been seeing this issue recently. I'll re-open if it pops up again.