Closed mturilli closed 9 years ago
This is still kicking and we would need it fixed for the upcoming AIMES demo on Nov 19 morning. From the logs:
2014:11:13 16:10:19 1178 PilotLauncherWorker-1 saga.PBSJobService : [ERROR ] Error running job via 'qsub': qsub: ncpus must be multiple of 16
. Commandline was: SCRIPTFILE=`mktemp -t SAGA-Python-PBSJobScript.XXXXXX` && echo "
#!/bin/bash
#PBS -V
#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT
#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR
#PBS -l walltime=0:16:00
#PBS -q batch
#PBS -A unc102
#PBS -l ncpus=11
export PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946
mkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946
cd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946
/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e 'source /usr/share/modules/init/bash' -e 'module load python' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE
2014:11:13 16:10:19 radical.pilot.MainProcess: [ERROR ] [{'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 641634), 'logentry': 'Using pilot agent /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 656684), 'logentry': 'Using bootstrapper /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 657568), 'logentry': "Copying bootstrapper 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946//default_bootstrapper.sh)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 17, 333920), 'logentry': u"Copying agent 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 42858), 'logentry': 'Submitting SAGA job with description: {\'Queue\': \'batch\', \'Executable\': \'/bin/bash\', \'WorkingDirectory\': \'/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\', \'Project\': \'unc102\', \'WallTimeLimit\': 16, \'Arguments\': [\'-l\', \'default_bootstrapper.sh\', "-n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b"], \'Error\': \'AGENT.STDERR\', \'Output\': \'AGENT.STDOUT\', \'TotalCPUCount\': 11}'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 317750), 'logentry': 'Pilot launching failed: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=`mktemp -t SAGA-Python-PBSJobScript.XXXXXX` && echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise) : raise exception(message))\nTraceback (most recent call last):\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 495, in run\n pilotjob.run()\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/job/job.py", line 397, in run\n return self._adaptor.run (ttype=ttype)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 51, in wrap_function\n return sync_function (self, *args, **kwargs)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 1125, in run\n self._id = self.js._job_run(self._api())\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 649, in _job_run\n log_error_and_raise(message, saga.NoSuccess, self._logger)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 99, in log_error_and_raise\n raise exception(message)\nNoSuccess: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=`mktemp -t SAGA-Python-PBSJobScript.XXXXXX` && echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise) : raise exception(message))\n'}]
A "real" fix for this on SAGA level is not straight. However, this would be relatively easy to workaround on RP level: we could configure the
On 13 nov. 2014, at 17:29, mturilli notifications@github.com wrote:
This is still kicking and we would need it fixed for the upcoming AIMES demo on Nov 19 morning. From the logs:
2014:11:13 16:10:19 1178 PilotLauncherWorker-1 saga.PBSJobService : [ERROR ] Error running job via 'qsub': qsub: ncpus must be multiple of 16 . Commandline was: SCRIPTFILE=
mktemp -t SAGA-Python-PBSJobScript.XXXXXX
&& echo "!/bin/bash
PBS -V
PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT
PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR
PBS -l walltime=0:16:00
PBS -q batch
PBS -A unc102
PBS -l ncpus=11
export PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 mkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 cd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 /bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e 'source /usr/share/modules/init/bash' -e 'module load python' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE 2014:11:13 16:10:19 radical.pilot.MainProcess: [ERROR ] [{'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 641634), 'logentry': 'Using pilot agent /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 656684), 'logentry': 'Using bootstrapper /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 657568), 'logentry': "Copying bootstrapper 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946//default_bootstrapper.sh)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 17, 333920), 'logentry': u "Copying agent 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 42858), 'logentry': 'Submitting SAGA job with description: {\'Queue\': \'batch\', \'Executable\': \'/bin/bash\', \'WorkingDirectory\': \'/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\', \'Project\': \'unc102\', \'WallTimeLimit\': 16, \'Arguments\': [\'-l\', \'default_bootstrapper.sh\', "-n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b"], \'Error\': \'AGENT.STDERR\', \'Output\': \'AGENT.ST DOUT\', \'TotalCPUCount\': 11}'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 317750), 'logentry': 'Pilot launching failed: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=
mktemp -t SAGA-Python-PBSJobScript.XXXXXX
&& echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0. 21 -m ec 2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise) : raise exception(message))\nTraceback (most recent call last):\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 495, in run\n pilotjob.run()\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/job/job.py", line 397, in run\n return self._adaptor.run (ttype=ttype)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 51, in wrap_functio n\n r eturn sync_function (self, _args, *_kwargs)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 1125, in run\n self._id = self.js._job_run(self._api())\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 649, in _job_run\n log_error_and_raise(message, saga.NoSuccess, self._logger)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 99, in log_error_and_raise\n raise exception(message)\nNoSuccess: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=mktemp -t SAGA-Python-PBSJobScript.XXXXXX
&& echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/ AGENT.ST DERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise) : raise exception(message))\n'}] — Reply to this email directly or view it on GitHub.
Sorry, Alec pressed "send" :)
What I wanted to say: We could configure the cores per node on resource level for systems that require it and take that into account in pilot requests.
On 13 nov. 2014, at 18:19, Mark Santcroos mark.santcroos@rutgers.edu wrote:
A "real" fix for this on SAGA level is not straight. However, this would be relatively easy to workaround on RP level: we could configure the
On 13 nov. 2014, at 17:29, mturilli notifications@github.com wrote:
This is still kicking and we would need it fixed for the upcoming AIMES demo on Nov 19 morning. From the logs:
2014:11:13 16:10:19 1178 PilotLauncherWorker-1 saga.PBSJobService : [ERROR ] Error running job via 'qsub': qsub: ncpus must be multiple of 16 . Commandline was: SCRIPTFILE=
mktemp -t SAGA-Python-PBSJobScript.XXXXXX
&& echo "!/bin/bash
PBS -V
PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT
PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR
PBS -l walltime=0:16:00
PBS -q batch
PBS -A unc102
PBS -l ncpus=11
export PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 mkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 cd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 /bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e 'source /usr/share/modules/init/bash' -e 'module load python' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE 2014:11:13 16:10:19 radical.pilot.MainProcess: [ERROR ] [{'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 641634), 'logentry': 'Using pilot agent /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 656684), 'logentry': 'Using bootstrapper /home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 15, 657568), 'logentry': "Copying bootstrapper 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/bootstrapper/default_bootstrapper.sh' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946//default_bootstrapper.sh)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 17, 333920), 'logentry': u "Copying agent 'file://localhost//home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/lib/python2.7/site-packages/radical/pilot/agent/radical-pilot-agent-multicore.py' to agent sandbox (sftp://blacklight.psc.xsede.org/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/)."}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 42858), 'logentry': 'Submitting SAGA job with description: {\'Queue\': \'batch\', \'Executable\': \'/bin/bash\', \'WorkingDirectory\': \'/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\', \'Project\': \'unc102\', \'WallTimeLimit\': 16, \'Arguments\': [\'-l\', \'default_bootstrapper.sh\', "-n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b"], \'Error\': \'AGENT.STDERR\', \'Output\': \'AGENT.ST DOUT\', \'TotalCPUCount\': 11}'}, {'timestamp': datetime.datetime(2014, 11, 13, 16, 10, 19, 317750), 'logentry': 'Pilot launching failed: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=
mktemp -t SAGA-Python-PBSJobScript.XXXXXX
&& echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0. 21 -m ec 2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise) : raise exception(message))\nTraceback (most recent call last):\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 495, in run\n pilotjob.run()\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/job/job.py", line 397, in run\n return self._adaptor.run (ttype=ttype)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 51, in wrap_functio n\n r eturn sync_function (self, _args, *_kwargs)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 1125, in run\n self._id = self.js._job_run(self._api())\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 649, in _job_run\n log_error_and_raise(message, saga.NoSuccess, self._logger)\n File "/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 99, in log_error_and_raise\n raise exception(message)\nNoSuccess: Error running job via \'qsub\': qsub: ncpus must be multiple of 16\n. Commandline was: SCRIPTFILE=mktemp -t SAGA-Python-PBSJobScript.XXXXXX
&& echo "\n#!/bin/bash \n#PBS -V \n#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/AGENT.STDOUT \n#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946/ AGENT.ST DERR \n#PBS -l walltime=0:16:00 \n#PBS -q batch \n#PBS -A unc102 \n#PBS -l ncpus=11\nexport PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946 \nmkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\ncd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-5464d7b820a641049a168946\n/bin/bash -l default_bootstrapper.sh -n radicalpilot -s 5464d7b120a641049a168942 -p 5464d7b820a641049a168946 -t 16 -c 11 -v 0.21 -m ec2-184-72-89-141.compute-1.amazonaws.com:27017 -a : -e \'source /usr/share/modules/init/bash\' -e \'module load python\' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-DEMO-SC2014/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +99 (log_error_and_raise) : raise exception(message))\n'}] — Reply to this email directly or view it on GitHub.
Hi Mark,
Sure, it seems reasonable to me.
I remember we discussed and tested a patch for this one based on ceiling. Not sure why is failing now.
Your memory was correct and my "proposed" solution also didn't come out of the blue sky either ...
Matteo reports this is not fixed.
Applied a fix on SAGA in https://github.com/radical-cybertools/saga-python/commit/32299fc8e6569395a051053ca93a09bfc9f2595c.
Added testcase test/issue359 in RP.
Currently that test fails because of #511, but if run one at a time it works.
I tested with the AIMES experiment runs and it still fails with:
Traceback (most recent call last):
File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 589, in run
pilotjob.run()
File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/job/job.py", line 416, in run
return self._adaptor.run (ttype=ttype)
File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
return sync_function (self, *args, **kwargs)
File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 1195, in run
self._id = self.js._job_run(self._api())
File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 691, in _job_run
log_error_and_raise(message, saga.NoSuccess, self._logger)
File "/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py", line 111, in log_error_and_raise
raise exception(message)
NoSuccess: Error running job via 'qsub': qsub: ncpus must be multiple of 16
#!/bin/bash
#PBS -V
#PBS -o /usr/users/9/mturilli/radical.pilot.sandbox/pilot-551365bf23769c5b821fe130/AGENT.STDOUT
#PBS -e /usr/users/9/mturilli/radical.pilot.sandbox/pilot-551365bf23769c5b821fe130/AGENT.STDERR
#PBS -l walltime=0:57:00
#PBS -q batch
#PBS -A unc101
#PBS -l ncpus=40
export PBS_O_WORKDIR=/usr/users/9/mturilli/radical.pilot.sandbox/pilot-551365bf23769c5b821fe130
mkdir -p /usr/users/9/mturilli/radical.pilot.sandbox/pilot-551365bf23769c5b821fe130
cd /usr/users/9/mturilli/radical.pilot.sandbox/pilot-551365bf23769c5b821fe130
export SAGA_PPN=16
/bin/bash -l default_bootstrapper.sh -n mturilli_aimesexperiments -s 551365bc23769c5b821fe12e -p 551365bf23769c5b821fe130 -t 57 -c 40 -v 0.23 -m 54.221.194.147:24242 -a : -e 'source /usr/share/modules/init/bash' -e 'module load python' -l TORQUE -j DPLACE -k MPIRUN_DPLACE -x luve -d 10 -b " > $SCRIPTFILE && /usr/local/packages/torque/2.3.13_psc/bin/qsub $SCRIPTFILE && rm -f $SCRIPTFILE (/home/mturilli/Virtualenvs/AIMES-EXPERIMENTS-TESTING/local/lib/python2.7/site-packages/saga/adaptors/pbs/pbsjob.py +111 (log_error_and_raise) : raise exception(message))
2015:03:26 01:49:59 radical.pilot.MainProcess: [INFO ] ComputePilot '551365bf23769c5b821fe130' state changed from 'Launching' to 'Failed'.
The pilot description is:
xsede.blacklight:
Allocation; None -> RP default : unc101
Queue; None -> RP default : None
Number of cores : 40
Walltime in minutes : 57
Stop once the workflow is done : True
No pilot logs available as expected.
Running some more tests, verifying that it was not a problem with the VE upgrade. Will report or close depending on the outcome of the test.
See #529.
I remember we discussed and tested a patch for this one based on ceiling. Not sure why is failing now. I am trying to run 1024 CUs on 3 pilots on trestles, gordon, and blacklight.
And so on on both gordon and blacklight.