Scheduler node fix - Githubissues

ngageoint / scale

Processing framework for containerized algorithms

Apache License 2.0

105 stars 45 forks source link

My thoughts on anytime the scheduler is killed while jobs are running (migration, restarted, etc.):

When the schedule comes back and hits this logic it is going to be given a list of nodes from the DB that used to be running jobs. This won't be an accurate list since those jobs are lost.
Whenever the scheduler cleans up the lost jobs (marked as CANCELED or FAILED) this function will stop returning those nodes in said list and they will be purged from scheduler memory and updated in the DB (is_active set to False) assuming they are gone and not offering resources.

ngageoint / scale