ngageoint / scale

Processing framework for containerized algorithms
http://ngageoint.github.io/scale/
Apache License 2.0
105 stars 45 forks source link

Verify 'sharedmem' jobs are not being scheduled #1826

Open emimaesmith opened 5 years ago

emimaesmith commented 5 years ago

Description The scheduler code silently skips queueing jobs that contain the resource sharedmem. We should either A. Log those jobs will not be scheduled due to requiring that resource, or B. add to the invalid_resources list to warn about the job type having invalid resources.

scheduler/scheduling/manager.py:
    def _process_queue(self, nodes, job_types, job_type_limits, job_type_resources, workspaces):
        """Retrieves the top of the queue and schedules new job executions on available nodes as resources and limits
        allow
        :param nodes: The dict of scheduling nodes stored by node ID for all nodes ready to accept new job executions
        :type nodes: dict
        :param job_types: The dict of job type models stored by job type ID
        :type job_types: dict
        :param job_type_limits: The dict of job type IDs mapping to job type limits
        :type job_type_limits: dict
        :param job_type_resources: The list of all of the job type resource requirements
        :type job_type_resources: list
        :param workspaces: A dict of all workspaces stored by name
        :type workspaces: dict
        :returns: The list of queued job executions that were scheduled
        :rtype: list
        """
        scheduled_job_executions = []
        ignore_job_type_ids = self._calculate_job_types_to_ignore(job_types, job_type_limits)
        started = now()

        max_cluster_resources = resource_mgr.get_max_available_resources()
        for queue in Queue.objects.get_queue(scheduler_mgr.config.queue_mode, ignore_job_type_ids)[:QUEUE_LIMIT]:
            job_exe = QueuedJobExecution(queue) 
            ... 
            invalid_resources = []
            insufficient_resources = []
            # get resource names offered and compare to job type resources
            for resource in job_exe.required_resources.resources:
                # skip sharedmem
                if resource.name.lower() == 'sharedmem':
                    continue
                if resource.name not in max_cluster_resources._resources:
                    # resource does not exist in cluster
                    invalid_resources.append(resource.name)
                elif resource.value > max_cluster_resources._resources[resource.name].value:
                    # resource exceeds the max available from any node
                    insufficient_resources.append(resource.name)

            if invalid_resources:
                description = INVALID_RESOURCES.description % invalid_resources
                scheduler_mgr.warning_active(warning, description)

            if insufficient_resources:
                description = INSUFFICIENT_RESOURCES.description % insufficient_resources
                scheduler_mgr.warning_active(warning, description)

            if invalid_resources or insufficient_resources:
                invalid_resources.extend(insufficient_resources)
                jt.unmet_resources = ','.join(invalid_resources)
                jt.save(update_fields=["unmet_resources"])
                continue
            else:
                # reset unmet_resources flag
                jt.unmet_resources = None
                scheduler_mgr.warning_inactive(warning)
                jt.save(update_fields=["unmet_resources"])        
            ...

        return scheduled_job_executions

Expected behavior We need to know why certain jobs are not being scheduled if these jobs are indeed not being scheduled.