mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
172 stars 51 forks source link

TORQUE trouble #149

Open wlandau-lilly opened 6 years ago

wlandau-lilly commented 6 years ago

I am having trouble running batchtools jobs on a local installation of TORQUE on Ubuntu 16.04. I think TORQUE is working because the following test.pbs produces the expected output.

#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00

cd $PBS_O_WORKDIR
touch done.txt
echo "done"

However, all my jobs hang in the E state. For example, the following R script waits indefinitely.

library("batchtools")
cf <- makeClusterFunctionsTORQUE("torque.tmpl") 
reg <- makeRegistry(NA)
reg$cluster.functions <- cf
batchMap(fun = identity, x = 1:4)
submitJobs()
waitForJobs() # waits here indefinitely
reduceResultsList() # not reached

In my case, the console message of wait_for_jobs()

Waiting (S:4 R:4 D:0 E:0) [-------------------]   0% eta:  ?s

does not match qstat, which shows jobs hanging in the E state.

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
98.localhost              ...8d7bd98804b04 wlandau         00:00:00 E batch          
99.localhost              ...fcce12fedcace wlandau         00:00:00 E batch          
100.localhost             ...dc63017b37ac6 wlandau         00:00:00 E batch          
101.localhost             ...b060e52879b8e wlandau         00:00:00 E batch 

I am using the @HenrikBengtsson's torque.tmpl from future.batchtools.

Related: see my Stack Overflow post here and HenrikBengtsson/future.batchtools#12.

mllg commented 6 years ago

Looks like the system is not set up properly. Can you submit and run jobs manually?

wlandau-lilly commented 6 years ago

Pretty much. For jobs that do not depend on other jobs (as opposed to drake with the future-powered parallel backend), the following test.pbs script generates the correct output.

#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00

cd $PBS_O_WORKDIR
touch done.txt
echo "done"

Then the job hangs in the E state indefinitely.

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
46.localhost              test             wlandau         00:00:00 E batch   

I was just using a simple qsub test.pbs.

mllg commented 6 years ago

So the manual job also gets stuch in the E state (E for exiting)? Then this is a configuration issue.

wlandau-lilly commented 6 years ago

Seems about right, I just wish I knew what the right configuration was.