ngageoint / scale

Processing framework for containerized algorithms
http://ngageoint.github.io/scale/
Apache License 2.0
105 stars 45 forks source link

500 Unschedulable Jobs #1801

Closed JohnPTobe closed 4 years ago

JohnPTobe commented 5 years ago

Description If there are 500 unschedulable jobs in Scale with high priority, anything with lower priority will not be able to be scheduled. The scheduler grabs 500 jobs (QUEUE_LIMIT) and attempts to schedule them. If none are able to be scheduled, it does nothing and returns, only to get those same 500 jobs again the next time it runs.

Reproduction Steps Steps to reproduce the problem:

  1. Create a job type that requests a gpu
  2. Have a cluster where only one node has a gpu resource and have something other than scale use that gpu. Scale should see that the gpu will theoretically be available at some point.
  3. Create 500 jobs of the gpu job type
  4. Create other jobs with lower priority that easily fit on nodes and some with higher priority
  5. The low priority jobs will never run despite space being available on nodes

Expected behavior Scale should wait to return until it has scheduled 500 jobs or skip some jobs if they have not been schedulable for awhile. The unschedulable jobs should not count towards the limit of 500 jobs.