radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

PBS script submission fails when ncpus is not multiple of 16 #359

Closed mturilli closed 9 years ago

mturilli commented 10 years ago

I remember we discussed and tested a patch for this one based on ceiling. Not sure why is failing now. I am trying to run 1024 CUs on 3 pilots on trestles, gordon, and blacklight.

2014:09:13 06:39:13 30089  PilotLauncherWorker-1 saga.PTYShell         : [DEBUG   ] run_sync: SCRIPTFILE=`mktemp -t SAGA-Python-PBSJobScript.XXXXXX` &&  echo "
#!/bin/bash 
#PBS -V 
#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1/AGENT.STDOUT 
#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1/AGENT.STDERR 
#PBS -l walltime=0:30:00 
#PBS -q batch 
#PBS -A unc102 
#PBS -l ncpus=341
export    PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1 
mkdir -p  /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1
cd        /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1
/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5413e65720a6417589a9b1ce -p 5413e65f20a6417589a9b1d1 -t 30 -d 10 -c 341 -v 0.19 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017  -e 'source /usr/share/modules/init/bash'  -e 'module load python'  -l TORQUE  -j DPLACE  -k MPIRUN_DPLACE  -x luve -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE
2014:09:13 06:39:13 radical.pilot.MainProcess: [DEBUG   ] write: [   25] [ 1032] (SCRIPTFILE=`mktemp -t SAGA-Pyt ... IPTFILE && rm -f $SCRIPTFILE\n)
2014:09:13 06:39:13 radical.pilot.MainProcess: [DEBUG   ] read : [   25] [   46] (qsub: ncpus must be multiple of 16\nPROMPT-1->)
2014:09:13 06:39:13 30089  PilotLauncherWorker-1 saga.PBSJobService    : [ERROR   ] Error running job via 'qsub': qsub: ncpus must be multiple of 16
. Commandline was: SCRIPTFILE=`mktemp -t SAGA-Python-PBSJobScript.XXXXXX` &&  echo "
#!/bin/bash 
#PBS -V 
#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1/AGENT.STDOUT 
#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1/AGENT.STDERR 
#PBS -l walltime=0:30:00 
#PBS -q batch 
#PBS -A unc102 
#PBS -l ncpus=341
export    PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1 
mkdir -p  /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1
cd        /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1
/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5413e65720a6417589a9b1ce -p 5413e65f20a6417589a9b1d1 -t 30 -d 10 -c 341 -v 0.19 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017  -e 'source /usr/share/modules/init/bash'  -e 'module load python'  -l TORQUE  -j DPLACE  -k MPIRUN_DPLACE  -x luve -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE
2014:09:13 06:39:13 radical.pilot.MainProcess: [INFO    ] ComputePilot '5413e65f20a6417589a9b1d1' state changed from 'Launching' to 'Failed'.
2014:09:13 06:39:13 radical.pilot.MainProcess: [ERROR   ] Pilot launching failed: Error running job via 'qsub': qsub: ncpus must be multiple of 16
. Commandline was: SCRIPTFILE=`mktemp -t SAGA-Python-PBSJobScript.XXXXXX` &&  echo "
#!/bin/bash 
#PBS -V 
#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1/AGENT.STDOUT 
#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1/AGENT.STDERR 
#PBS -l walltime=0:30:00 
#PBS -q batch 
#PBS -A unc102 
#PBS -l ncpus=341
export    PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1 
mkdir -p  /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1
cd        /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5413e65f20a6417589a9b1d1
/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5413e65720a6417589a9b1ce -p 5413e65f20a6417589a9b1d1 -t 30 -d 10 -c 341 -v 0.19 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017  -e 'source /usr/share/modules/init/bash'  -e 'module load python'  -l TORQUE  -j DPLACE  -k MPIRUN_DPLACE  -x luve -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-CCGRID2015/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise)  :  raise exception(message))
Traceback (most recent call last):
  File "/home/mturilli/Virtualenvs/AIMES-CCGRID2015/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 459, in run
    pilotjob.run()
  File "/home/mturilli/Virtualenvs/AIMES-CCGRID2015/local/lib/python2.7/site-packages/saga/job/job.py", line 397, in run
    return self._adaptor.run (ttype=ttype)
  File "/home/mturilli/Virtualenvs/AIMES-CCGRID2015/local/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 51, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/home/mturilli/Virtualenvs/AIMES-CCGRID2015/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 1101, in run
    self._id = self.js._job_run(self._api())
  File "/home/mturilli/Virtualenvs/AIMES-CCGRID2015/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 626, in _job_run
    log_error_and_raise(message, saga.NoSuccess, self._logger)
  File "/home/mturilli/Virtualenvs/AIMES-CCGRID2015/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 99, in log_error_and_raise
    raise exception(message)
NoSuccess: Error running job via 'qsub': qsub: ncpus must be multiple of 16

And so on on both gordon and blacklight.

mturilli commented 9 years ago

This is still kicking and we would need it fixed for the upcoming AIMES demo on Nov 19 morning. From the logs:

2014:11:13 16:10:19 1178   PilotLauncherWorker-1 saga.PBSJobService    : [ERROR   ] Error running job via 'qsub': qsub: ncpus must be multiple of 16
. Commandline was: SCRIPTFILE=`mktemp -t SAGA-Python-PBSJobScript.XXXXXX` &&  echo "
#!/bin/bash 
#PBS -V 
#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT 
#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR 
#PBS -l walltime=0:16:00 
#PBS -q batch 
#PBS -A unc102 
#PBS -l ncpus=11
export    PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 
mkdir -p  /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946
cd        /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946
/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017  -a :  -e 'source /usr/share/modules/init/bash'  -e 'module load python'  -l TORQUE  -j DPLACE  -k MPIRUN_DPLACE  -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE
2014:11:13 16:10:19 radical.pilot.MainProcess: [ERROR   ] [{'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 641634), 'logentry': 'Using pilot agent /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 656684), 'logentry': 'Using bootstrapper /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 657568), 'logentry': "Copying bootstrapper 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946//default_bootstrapper.sh)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 17, 333920), 'logentry': u"Copying agent 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 42858), 'logentry': 'Submitting SAGA job with description: {\'Queue\': \'batch\', \'Executable\': \'/bin/bash\', \'WorkingDirectory\': \'/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\', \'Project\': \'unc102\', \'WallTimeLimit\': 16, \'Arguments\': [\'-l\', \'default_bootstrapper.sh\', "-n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017  -a :  -e \'source /usr/share/modules/init/bash\'  -e \'module load python\'  -l TORQUE  -j DPLACE  -k MPIRUN_DPLACE  -x luve -d 10 -b"], \'Error\': \'AGENT.STDERR\', \'Output\': \'AGENT.STDOUT\', \'TotalCPUCount\': 11}'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 317750), 'logentry': 'Pilot launching failed: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=`mktemp -t SAGA-Python-PBSJobScript.XXXXXX` &&  echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport    PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p  /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd        /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017  -a :  -e \'source /usr/share/modules/init/bash\'  -e \'module load python\'  -l TORQUE  -j DPLACE  -k MPIRUN_DPLACE  -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise)  :  raise exception(message))\nTraceback (most recent call last):\n  File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 495, in run\n    pilotjob.run()\n  File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/job/job.py", line 397, in run\n    return self._adaptor.run (ttype=ttype)\n  File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 51, in wrap_function\n    return sync_function (self, *args, **kwargs)\n  File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 1125, in run\n    self._id = self.js._job_run(self._api())\n  File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 649, in _job_run\n    log_error_and_raise(message, saga.NoSuccess, self._logger)\n  File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 99, in log_error_and_raise\n    raise exception(message)\nNoSuccess: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=`mktemp -t SAGA-Python-PBSJobScript.XXXXXX` &&  echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport    PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p  /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd        /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017  -a :  -e \'source /usr/share/modules/init/bash\'  -e \'module load python\'  -l TORQUE  -j DPLACE  -k MPIRUN_DPLACE  -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise)  :  raise exception(message))\n'}]
marksantcroos commented 9 years ago

A "real" fix for this on SAGA level is not straight. However, this would be relatively easy to workaround on RP level: we could configure the

On 13 nov. 2014, at 17:29, mturilli notifications@github.com wrote:

This is still kicking and we would need it fixed for the upcoming AIMES demo on Nov 19 morning. From the logs:

2014:11:13 16:10:19 1178 PilotLauncherWorker-1 saga.PBSJobService : [ERROR ] Error running job via 'qsub': qsub: ncpus must be multiple of 16 . Commandline was: SCRIPTFILE=mktemp -t SAGA-Python-PBSJobScript.XXXXXX && echo "

!/bin/bash

PBS -V

PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT

PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR

PBS -l walltime=0:16:00

PBS -q batch

PBS -A unc102

PBS -l ncpus=11

export PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 mkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 cd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 /bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e 'source /usr/share/modules/init/bash' -e 'module load python' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE 2014:11:13 16:10:19 radical.pilot.MainProcess: [ERROR ] [{'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 641634), 'logentry': 'Using pilot agent /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 656684), 'logentry': 'Using bootstrapper /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 657568), 'logentry': "Copying bootstrapper 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946//default_bootstrapper.sh)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 17, 333920), 'logentry': u "Copying agent 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 42858), 'logentry': 'Submitting SAGA job with description: {\'Queue\': \'batch\', \'Executable\': \'/bin/bash\', \'WorkingDirectory\': \'/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\', \'Project\': \'unc102\', \'WallTimeLimit\': 16, \'Arguments\': [\'-l\', \'default_bootstrapper.sh\', "-n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b"], \'Error\': \'AGENT.STDERR\', \'Output\': \'AGENT.ST DOUT\', \'TotalCPUCount\': 11}'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 317750), 'logentry': 'Pilot launching failed: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=mktemp -t SAGA-Python-PBSJobScript.XXXXXX && echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0. 21 -m ec 2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise) : raise exception(message))\nTraceback (most recent call last):\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 495, in run\n pilotjob.run()\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/job/job.py", line 397, in run\n return self._adaptor.run (ttype=ttype)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 51, in wrap_functio n\n r eturn sync_function (self, _args, *_kwargs)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 1125, in run\n self._id = self.js._job_run(self._api())\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 649, in _job_run\n log_error_and_raise(message, saga.NoSuccess, self._logger)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 99, in log_error_and_raise\n raise exception(message)\nNoSuccess: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=mktemp -t SAGA-Python-PBSJobScript.XXXXXX && echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/ AGENT.ST DERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise) : raise exception(message))\n'}] — Reply to this email directly or view it on GitHub.

marksantcroos commented 9 years ago

Sorry, Alec pressed "send" :)

What I wanted to say: We could configure the cores per node on resource level for systems that require it and take that into account in pilot requests.

On 13 nov. 2014, at 18:19, Mark Santcroos mark.santcroos@rutgers.edu wrote:

A "real" fix for this on SAGA level is not straight. However, this would be relatively easy to workaround on RP level: we could configure the

On 13 nov. 2014, at 17:29, mturilli notifications@github.com wrote:

This is still kicking and we would need it fixed for the upcoming AIMES demo on Nov 19 morning. From the logs:

2014:11:13 16:10:19 1178 PilotLauncherWorker-1 saga.PBSJobService : [ERROR ] Error running job via 'qsub': qsub: ncpus must be multiple of 16 . Commandline was: SCRIPTFILE=mktemp -t SAGA-Python-PBSJobScript.XXXXXX && echo "

!/bin/bash

PBS -V

PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT

PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR

PBS -l walltime=0:16:00

PBS -q batch

PBS -A unc102

PBS -l ncpus=11

export PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 mkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 cd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 /bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e 'source /usr/share/modules/init/bash' -e 'module load python' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE 2014:11:13 16:10:19 radical.pilot.MainProcess: [ERROR ] [{'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 641634), 'logentry': 'Using pilot agent /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 656684), 'logentry': 'Using bootstrapper /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 657568), 'logentry': "Copying bootstrapper 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946//default_bootstrapper.sh)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 17, 333920), 'logentry': u "Copying agent 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 42858), 'logentry': 'Submitting SAGA job with description: {\'Queue\': \'batch\', \'Executable\': \'/bin/bash\', \'WorkingDirectory\': \'/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\', \'Project\': \'unc102\', \'WallTimeLimit\': 16, \'Arguments\': [\'-l\', \'default_bootstrapper.sh\', "-n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b"], \'Error\': \'AGENT.STDERR\', \'Output\': \'AGENT.ST DOUT\', \'TotalCPUCount\': 11}'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 317750), 'logentry': 'Pilot launching failed: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=mktemp -t SAGA-Python-PBSJobScript.XXXXXX && echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0. 21 -m ec 2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise) : raise exception(message))\nTraceback (most recent call last):\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 495, in run\n pilotjob.run()\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/job/job.py", line 397, in run\n return self._adaptor.run (ttype=ttype)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 51, in wrap_functio n\n r eturn sync_function (self, _args, *_kwargs)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 1125, in run\n self._id = self.js._job_run(self._api())\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 649, in _job_run\n log_error_and_raise(message, saga.NoSuccess, self._logger)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 99, in log_error_and_raise\n raise exception(message)\nNoSuccess: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=mktemp -t SAGA-Python-PBSJobScript.XXXXXX && echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/ AGENT.ST DERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise) : raise exception(message))\n'}] — Reply to this email directly or view it on GitHub.

mturilli commented 9 years ago

Hi Mark,

Sure, it seems reasonable to me.

marksantcroos commented 9 years ago

I remember we discussed and tested a patch for this one based on ceiling. Not sure why is failing now.

Your memory was correct and my "proposed" solution also didn't come out of the blue sky either ...

marksantcroos commented 9 years ago

Matteo reports this is not fixed.

marksantcroos commented 9 years ago

Applied a fix on SAGA in https://github.com/radical-cybertools/saga-python/commit/32299fc8e6569395a051053ca93a09bfc9f2595c.

Added testcase test/issue359 in RP.

Currently that test fails because of #511, but if run one at a time it works.

mturilli commented 9 years ago

I tested with the AIMES experiment runs and it still fails with:

Traceback (most recent call last):
  File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 589, in run
    pilotjob.run()
  File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/job/job.py", line 416, in run
    return self._adaptor.run (ttype=ttype)
  File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 1195, in run
    self._id = self.js._job_run(self._api())
  File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 691, in _job_run
    log_error_and_raise(message, saga.NoSuccess, self._logger)
  File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 111, in log_error_and_raise
    raise exception(message)
NoSuccess: Error running job via 'qsub': qsub: ncpus must be multiple of 16

#!/bin/bash 
#PBS -V 
#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-551365bf23769c5b821fe130/AGENT.STDOUT 
#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-551365bf23769c5b821fe130/AGENT.STDERR 
#PBS -l walltime=0:57:00 
#PBS -q batch 
#PBS -A unc101 
#PBS -l ncpus=40
export    PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-551365bf23769c5b821fe130 
mkdir -p  /usr/users/9/mturilli/radical.pilot.sandbox/pilot-551365bf23769c5b821fe130
cd        /usr/users/9/mturilli/radical.pilot.sandbox/pilot-551365bf23769c5b821fe130
export SAGA_PPN=16
/bin/bash -l default_bootstrapper.sh -n mturilli_aimesexperiments -s 551365bc23769c5b821fe12e -p 551365bf23769c5b821fe130 -t 57 -c 40 -v 0.23 -m 54.221.194.147:24242  -a :  -e 'source /usr/share/modules/init/bash'  -e 'module load python'  -l TORQUE  -j DPLACE  -k MPIRUN_DPLACE  -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +111 (log_error_and_raise)  :  raise exception(message))
2015:03:26 01:49:59 radical.pilot.MainProcess: [INFO    ] ComputePilot '551365bf23769c5b821fe130' state changed from 'Launching' to 'Failed'.

The pilot description is:

xsede.blacklight:
    Allocation; None -> RP default : unc101
    Queue; None -> RP default      : None
    Number of cores                : 40
    Walltime in minutes            : 57
    Stop once the workflow is done : True

No pilot logs available as expected.

mturilli commented 9 years ago

Running some more tests, verifying that it was not a problem with the VE upgrade. Will report or close depending on the outcome of the test.

marksantcroos commented 9 years ago

See #529.