psnc-qcg / QCG-PilotJob

The QCG Pilot Job service for execution of many computing tasks inside one allocation
Apache License 2.0
11 stars 2 forks source link

SchedulerAlgorithm::allocate_job() broken? #149

Closed LourensVeen closed 2 years ago

LourensVeen commented 2 years ago

I'm looking at the allocate_job(self, reqs) method of the SchedulerAlgorithm class in components/core/qcg/pilotjob/scheduleralgo.py.

It starts by checking the requested resources: if nodes have been specified then it checks that either an exact number or a range has been given, and sets min_nodes to the number we're going to try for. If no node specification has been given, then min_nodes is set to (remains, rather) zero.

Next, it does the same for the number of requested cores, setting min_cores. However, it then computes (line 208) total_cores = min_nodes * min_cores, which seems like it would set total_cores to 0 if no nodes have been given, and then always fail on the subsequent resource check.

The example at https://qcg-pilotjob.readthedocs.io/en/develop/slurm_performance.html#user-parallel-applications suggests that it's legal to only specify cores, and that makes sense. Is this a bug or am I missing something?

pkopta commented 2 years ago

Yes, you are right - this is a bug - but curiously does not affect the algorithm. The total_cores is used only to check wheter requirements are less than available resources, and because min_nodes equals 0 - total_cores (requested number of cores) also equals 0. And yes, it's legal to define only number of cores - in that case the requested cores will be allocated on any number of nodes.

pkopta commented 2 years ago

The #151 should fix the bug.

LourensVeen commented 2 years ago

Makes sense to me, thanks!

LourensVeen commented 2 years ago

Merged, closing!