Error when running Canu in PBS system

tangerzhang commented 8 years ago

Hello, I am trying to run Canu in our PBS system, however, got error like this. Do you have any idea to fix it? Thanks!

___________________________________________________________________________
-- Detected 20 CPUs and 63 gigabytes of memory.
-- Detected Java(TM) Runtime Environment '1.8.0_60' (from 'java').
-- Detected PBS/Torque with 'pbsnodes' binary in /usr/local/bin/pbsnodes.
socket_connect_unix failed: 15137
pbsnodes: cannot connect to server melon, error=15137 (could not connect to trqauthd)
-- 
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
maxMemory 1048576 maxThreads 1024
--
-- Allowed to run under grid control, and use up to   4 compute threads and   16 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to  16 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to  16 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to  16 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    8 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   4 compute threads and   32 GB memory for stage 'utgcns (consensus'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    4 GB memory for stage 'overlap store sequential building'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    4 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    6 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    8 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    8 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and   16 GB memory for stage 'falcon_sense (read correction)'.
--
-- This is canu parallel iteration #1, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 4800000
--   errorRate           -- 0.025
--   
--   corOvlErrorRate     -- 0.075
--   obtOvlErrorRate     -- 0.075
--   utgOvlErrorRate     -- 0.075
--   
--   obtErrorRate        -- 0.075
--   
--   utgGraphErrorRate   -- 0.05
--   utgBubbleErrorRate  -- 0.0625
--   utgMergeErrorRate   -- 0.0375
--   utgRepeatErrorRate  -- 0.05
--   
--   corErrorRate        -- 0.30
--   cnsErrorRate        -- 0.0625
--
--
-- BEGIN CORRECTION
--
--
-- GATEKEEPER (correction)
--
--
-- Starting command on Fri Feb 26 14:53:59 2016 with 908.7 GB free disk space
--
/share/workplace/home/zhangxt/software/canu-1.0/Linux-amd64/bin/gatekeeperCreate \
  -minlength 1000 \
  -o /share/bioinfo/zhangxt/test/Canu_test/ecoli-auto/correction/ecoli.gkpStore.BUILDING \
  /share/bioinfo/zhangxt/test/Canu_test/ecoli-auto/correction/ecoli.gkpStore.gkp \
> /share/bioinfo/zhangxt/test/Canu_test/ecoli-auto/correction/ecoli.gkpStore.err 2>&1
--
-- Finished on Fri Feb 26 14:55:30 2016 (91 seconds) with 908.1 GB free disk space
gnuplot < /share/bioinfo/zhangxt/test/Canu_test/ecoli-auto/correction/ecoli.gkpStore/readlengths.gp \
> /dev/null 2>&1

ERROR: Failed with signal 127

--
-- In gatekeeper store '/share/bioinfo/zhangxt/test/Canu_test/ecoli-auto/correction/ecoli.gkpStore':
--   Found 12528 reads.
--   Found 115899341 bases (24.14 times coverage).
--
--   Read length histogram (one '*' equals 20.62 reads):
--        0    999      0 
--     1000   1999   1444 **********************************************************************
--     2000   2999   1328 ****************************************************************
--     3000   3999   1065 ***************************************************
--     4000   4999    774 *************************************
--     5000   5999    668 ********************************
--     6000   6999    619 ******************************
--     7000   7999    618 *****************************
--     8000   8999    607 *****************************
--     9000   9999    560 ***************************
--    10000  10999    523 *************************
--    11000  11999    478 ***********************
--    12000  12999    429 ********************
--    13000  13999    379 ******************
--    14000  14999    366 *****************
--    15000  15999    353 *****************
--    16000  16999    329 ***************
--    17000  17999    297 **************
--    18000  18999    294 **************
--    19000  19999    283 *************
--    20000  20999    251 ************
--    21000  21999    195 *********
--    22000  22999    152 *******
--    23000  23999    132 ******
--    24000  24999     75 ***
--    25000  25999     66 ***
--    26000  26999     56 **
--    27000  27999     44 **
--    28000  28999     35 *
--    29000  29999     16 
--    30000  30999     21 *
--    31000  31999     18 
--    32000  32999     11 
--    33000  33999      8 
--    34000  34999      6 
--    35000  35999      6 
--    36000  36999     10 
--    37000  37999      2 
--    38000  38999      3 
--    39000  39999      2 
--    40000  40999      2 
--    41000  41999      2 
--    42000  42999      1 
-- MERYL (correction)
-- Meryl attempt 1 begins.
--
-- Starting command on Fri Feb 26 14:55:44 2016 with 907.9 GB free disk space
--
  qsub \
    -l mem=8g -l nodes=1:ppn=4 \
    -d `pwd` -N "meryl_ecoli" \
    -t 1-1 \
    -j oe -o /share/bioinfo/zhangxt/test/Canu_test/ecoli-auto/correction/0-mercounts/meryl.\$PBS_ARRAYID.out \
    /share/bioinfo/zhangxt/test/Canu_test/ecoli-auto/correction/0-mercounts/meryl.sh

socket_connect_unix failed: 15137
qsub: cannot connect to server (null) (errno=15137) could not connect to trqauthd
--
-- Finished on Fri Feb 26 14:55:46 2016 (2 seconds) with 907.9 GB free disk space

ERROR: Failed with signal NUM33 (33)

================================================================================
Please panic.  canu failed, and it shouldn't have.

Stack trace:

 at /share/workplace/home/zhangxt/software/canu-1.0/Linux-amd64/bin/lib/canu/Defaults.pm line 220
    canu::Defaults::caFailure('Failed to submit batch jobs', undef) called at /share/workplace/home/zhangxt/software/canu-1.0/Linux-amd64/bin/lib/canu/Execution.pm line 1125
    canu::Execution::submitOrRunParallelJob('/share/bioinfo/zhangxt/test/Canu_test/ecoli-auto', 'ecoli', 'meryl', '/share/bioinfo/zhangxt/test/Canu_test/ecoli-auto/correction/0...', 'meryl', 1) called at /share/workplace/home/zhangxt/software/canu-1.0/Linux-amd64/bin/lib/canu/Meryl.pm line 333
    canu::Meryl::merylCheck('/share/bioinfo/zhangxt/test/Canu_test/ecoli-auto', 'ecoli', 'cor') called at /share/workplace/home/zhangxt/software/canu-1.0/Linux-amd64/bin/canu line 402

canu failed with 'Failed to submit batch jobs'.

skoren commented 8 years ago

This sounds like a PBS configuration issue. Are your compute nodes allowed to submit jobs? Canu requires the nodes to be able to submit jobs because it re-submits array jobs and the master pipeline script as it runs. If you can, I would have the compute nodes (or a subset of them) to be configured to allow job submissions.

Without this, you would have to either run Canu on a single machine (useGrid=0) or you would have to manually keep re-starting the pipeline. For example, above you would have to run the command

qsub \
-l mem=8g -l nodes=1:ppn=4 \
-d pwd -N "meryl_ecoli" \
-t 1-1 \
-j oe -o /share/bioinfo/zhangxt/test/Canu_test/ecoli-auto/correction/0-mercounts/meryl.\$PBS_ARRAYID.out \
/share/bioinfo/zhangxt/test/Canu_test/ecoli-auto/correction/0-mercounts/meryl.sh

and once it completes re-launch Canu as you did initially. There will be many steps that stop due to this so you'd have to restart many times.

tangerzhang commented 8 years ago

Hi Skoren, Thanks for your suggestion. Our computer nodes do allow to submit jobs and we usually submit jobs with command line "qsub job_script.sh". I can successfully run Canu on a single machine, however, I prefer to run it on the grid because that will be much faster. Do you have any suggestion that how can I fix this problem so that our PBS system is compatible with Canu? Thanks a lot.

skoren commented 8 years ago

The error messages in your log indicate none of the PBS utilities work on the compute node. There are errors from both pbsnodes and qsub connecting to the daemon. This usually indicates that the nodes aren't set up to submit/don't have the PBS daemon running. Unfortunately there is nothing you can do from within Canu to make it work.

Have you tried logging into a compute node and running qsub or pbsnodes to confirm it works? If you make the PBS utilities work on your compute does then Canu will run on your cluster.

skoren commented 8 years ago

Closing due to inactivity, PBS configuration issue not Canu.

marbl / canu

Error when running Canu in PBS system #54