Test and verify if MPI example works on Rivanna

radical-cybertools / radical.pilot

RADICAL-Pilot

http://radical-cybertools.github.io/radical-pilot/index.html

Other

54 stars 23 forks source link

Test and verify if MPI example works on Rivanna #2922

Closed AymenFJA closed 1 year ago

AymenFJA commented 1 year ago

This is related to https://github.com/radical-cybertools/radical.pilot/pull/2855#issuecomment-1479978719. And need to be investigated and reproduced if possible.

eirrgang commented 1 year ago

One reason I brought up the issue is because I was concerned that there might be something odd about the SAGA layer details in the way it requested a Pilot job in the MPI examples versus in the basic 00 example. As noted, the examples sat in the queue for tens of hours without launching. However, I was able to get manually submitted jobs with the same resource requirements to run in the same time.

andre-merzky commented 1 year ago

[...] the examples sat in the queue for tens of hours without launching. However, I was able to get manually submitted jobs with the same resource requirements to run in the same time.

That is indeed puzzling - can you discern any differences in the batch scripts which could explain the difference?

eirrgang commented 1 year ago

[...] the examples sat in the queue for tens of hours without launching. However, I was able to get manually submitted jobs with the same resource requirements to run in the same time.

That is indeed puzzling - can you discern any differences in the batch scripts which could explain the difference?

As I recall, the generated job scripts included a --cores-per-node #SBATCH line, or something like that, from the resource definition, even if the --ntasks * --cpus-per-task was less than that. I wondered whether that was taking precedence in the resource allocation request, somehow, or if some other over-specified aspect was gumming up the works. My Slurm-Fu is not fierce, though, and I felt ill-equipped to investigate further at the time, especially while there were other changes in flight throughout the RCT stack.

andre-merzky commented 1 year ago

Thanks Eric. @AymenFJA : have you seen similar problems with RCT submitted jobs on Rivanna?

AymenFJA commented 1 year ago

@andre-merzky I am testing it now without RP as I am trying to verify if it is RP's issue or if something is wrong with Rivanna's queueing system. So far, this request has been pending for ~40 minutes:

 -bash-4.2$ijob --nodes=1 -p standard --ntasks=40 --ntasks-per-node=40 
salloc: Pending job allocation 49472640
salloc: job 49472640 queued and waiting for resources

Which is relatively similar to what RP is requesting via SAGA. Assuming my request is invalid, it should at least raise an error or terminate the request but nothing so far.

mtitov commented 1 year ago

@AymenFJA this was an initial testing and was completed, right?

AymenFJA commented 1 year ago

Correct, two days ago, we were able to accomplish this. I will close it now.