Description
The scheduler code silently skips queueing jobs that contain the resource sharedmem. We should either
A. Log those jobs will not be scheduled due to requiring that resource, or
B. add to the invalid_resources list to warn about the job type having invalid resources.
scheduler/scheduling/manager.py:
def _process_queue(self, nodes, job_types, job_type_limits, job_type_resources, workspaces):
"""Retrieves the top of the queue and schedules new job executions on available nodes as resources and limits
allow
:param nodes: The dict of scheduling nodes stored by node ID for all nodes ready to accept new job executions
:type nodes: dict
:param job_types: The dict of job type models stored by job type ID
:type job_types: dict
:param job_type_limits: The dict of job type IDs mapping to job type limits
:type job_type_limits: dict
:param job_type_resources: The list of all of the job type resource requirements
:type job_type_resources: list
:param workspaces: A dict of all workspaces stored by name
:type workspaces: dict
:returns: The list of queued job executions that were scheduled
:rtype: list
"""
scheduled_job_executions = []
ignore_job_type_ids = self._calculate_job_types_to_ignore(job_types, job_type_limits)
started = now()
max_cluster_resources = resource_mgr.get_max_available_resources()
for queue in Queue.objects.get_queue(scheduler_mgr.config.queue_mode, ignore_job_type_ids)[:QUEUE_LIMIT]:
job_exe = QueuedJobExecution(queue)
...
invalid_resources = []
insufficient_resources = []
# get resource names offered and compare to job type resources
for resource in job_exe.required_resources.resources:
# skip sharedmem
if resource.name.lower() == 'sharedmem':
continue
if resource.name not in max_cluster_resources._resources:
# resource does not exist in cluster
invalid_resources.append(resource.name)
elif resource.value > max_cluster_resources._resources[resource.name].value:
# resource exceeds the max available from any node
insufficient_resources.append(resource.name)
if invalid_resources:
description = INVALID_RESOURCES.description % invalid_resources
scheduler_mgr.warning_active(warning, description)
if insufficient_resources:
description = INSUFFICIENT_RESOURCES.description % insufficient_resources
scheduler_mgr.warning_active(warning, description)
if invalid_resources or insufficient_resources:
invalid_resources.extend(insufficient_resources)
jt.unmet_resources = ','.join(invalid_resources)
jt.save(update_fields=["unmet_resources"])
continue
else:
# reset unmet_resources flag
jt.unmet_resources = None
scheduler_mgr.warning_inactive(warning)
jt.save(update_fields=["unmet_resources"])
...
return scheduled_job_executions
Expected behavior
We need to know why certain jobs are not being scheduled if these jobs are indeed not being scheduled.
Description The scheduler code silently skips queueing jobs that contain the resource
sharedmem
. We should either A. Log those jobs will not be scheduled due to requiring that resource, or B. add to theinvalid_resources
list to warn about the job type having invalid resources.Expected behavior We need to know why certain jobs are not being scheduled if these jobs are indeed not being scheduled.