Closed AymenFJA closed 1 year ago
One reason I brought up the issue is because I was concerned that there might be something odd about the SAGA layer details in the way it requested a Pilot job in the MPI examples versus in the basic 00 example. As noted, the examples sat in the queue for tens of hours without launching. However, I was able to get manually submitted jobs with the same resource requirements to run in the same time.
[...] the examples sat in the queue for tens of hours without launching. However, I was able to get manually submitted jobs with the same resource requirements to run in the same time.
That is indeed puzzling - can you discern any differences in the batch scripts which could explain the difference?
[...] the examples sat in the queue for tens of hours without launching. However, I was able to get manually submitted jobs with the same resource requirements to run in the same time.
That is indeed puzzling - can you discern any differences in the batch scripts which could explain the difference?
As I recall, the generated job scripts included a --cores-per-node
#SBATCH
line, or something like that, from the resource definition, even if the --ntasks
* --cpus-per-task
was less than that. I wondered whether that was taking precedence in the resource allocation request, somehow, or if some other over-specified aspect was gumming up the works. My Slurm-Fu is not fierce, though, and I felt ill-equipped to investigate further at the time, especially while there were other changes in flight throughout the RCT stack.
Thanks Eric. @AymenFJA : have you seen similar problems with RCT submitted jobs on Rivanna?
@andre-merzky I am testing it now without RP as I am trying to verify if it is RP's issue or if something is wrong with Rivanna's queueing system. So far, this request has been pending for ~40 minutes:
-bash-4.2$ijob --nodes=1 -p standard --ntasks=40 --ntasks-per-node=40
salloc: Pending job allocation 49472640
salloc: job 49472640 queued and waiting for resources
Which is relatively similar to what RP is requesting via SAGA. Assuming my request is invalid, it should at least raise an error or terminate the request but nothing so far.
@AymenFJA this was an initial testing and was completed, right?
Correct, two days ago, we were able to accomplish this. I will close it now.
This is related to https://github.com/radical-cybertools/radical.pilot/pull/2855#issuecomment-1479978719. And need to be investigated and reproduced if possible.