running Canu1.6 with SGE terminate the process before correction

ml3958 commented 7 years ago

Hi I am using Canu 1.6 on a remote cluster with SGE resources. I have no problem running Canu without using SGE. The problem rises when I try to use SGE resource:

firstly, Canu can't configure SGE with following message.

-- WARNING:  Couldn't determine the SGE resource to request memory.
-- WARNING:  Valid choices are (pick one and supply it to canu):
-- WARNING:    gridEngineMemoryOption="-l mem_token=MEMORY"
-- WARNING:    gridEngineMemoryOption="-l tmp_token=MEMORY"
ABORT:
ABORT: Canu 1.6
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   can't configure for SGE.
ABORT:

Then, I added option gridEngineMemoryOption="-l h_vmem=MEMORY -l mem_free=MEMORY". But the submitted job finished with in seconds, leaving only two folders in the output directory. It didn't really perform any correction, trimming or assembly.
```
canu-logs:
1505488976_phoenix2_22420_canu
```

canu-scripts: canu.01.out canu.01.sh


following is the message

-- Canu 1.6

-- CITATIONS

-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. -- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. -- Genome Res. 2017 May;27(5):722-736. -- http://doi.org/10.1101/gr.215087.116 -- -- Read and contig alignments during correction, consensus and GFA building use: -- Šošic M, Šikic M. -- Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. -- Bioinformatics. 2017 May 1;33(9):1394-1395. -- http://doi.org/10.1093/bioinformatics/btw753 -- -- Overlaps are generated using: -- Berlin K, et al. -- Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. -- Nat Biotechnol. 2015 Jun;33(6):623-30. -- http://doi.org/10.1038/nbt.3238 -- -- Myers EW, et al. -- A Whole-Genome Assembly of Drosophila. -- Science. 2000 Mar 24;287(5461):2196-204. -- http://doi.org/10.1126/science.287.5461.2196 -- -- Li H. -- Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. -- Bioinformatics. 2016 Jul 15;32(14):2103-10. -- http://doi.org/10.1093/bioinformatics/btw152 -- -- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense: -- Chin CS, et al. -- Phased diploid genome assembly with single-molecule real-time sequencing. -- Nat Methods. 2016 Dec;13(12):1050-1054. -- http://doi.org/10.1038/nmeth.4035 -- -- Contig consensus sequences are generated using an algorithm derived from pbdagcon: -- Chin CS, et al. -- Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. -- Nat Methods. 2013 Jun;10(6):563-9 -- http://doi.org/10.1038/nmeth.2474 -- -- CONFIGURE CANU

-- Detected Java(TM) Runtime Environment '1.8.0_141' (from '/usr/lib/jvm/java-1.8.0/bin/java'). -- Detected gnuplot version '4.2 patchlevel 6 ' (from 'gnuplot') and image format 'png'. -- Detected 32 CPUs and 126 gigabytes of memory. -- Detected Sun Grid Engine in '/cm/shared/apps/sge/2011.11p1/default'. -- Detected Grid Engine environment 'threaded'. -- User supplied Grid Engine consumable '-l h_vmem=MEMORY -l mem_free=MEMORY'.

-- WARNING: -- WARNING: Queue 'gpu1.q' has start mode set to 'posix_behavior' and shell set to '/bin/csh'. -- WARNING: -- WARNING: Some queues in your configuration will fail to start jobs correctly. -- WARNING: Jobs will be submitted with option: -- WARNING: gridOptions=-S /bin/sh -- WARNING: -- WARNING: If jobs fail to start, modify the above option to use a valid shell -- WARNING: and supply it directly to canu. -- WARNING: -- -- Found 1 host with 64 cores and 1009 GB memory under Sun Grid Engine control. -- Found 5 hosts with 32 cores and 125 GB memory under Sun Grid Engine control. -- Found 1 host with 8 cores and 62 GB memory under Sun Grid Engine control. -- Found 2 hosts with 48 cores and 755 GB memory under Sun Grid Engine control. -- Found 63 hosts with 32 cores and 252 GB memory under Sun Grid Engine control.

-- (tag)Threads -- (tag)Memory	-- (tag)		algorithm

-- Grid: meryl 8 GB 4 CPUs (k-mer counting) -- Grid: cormhap 6 GB 8 CPUs (overlap detection with mhap) -- Grid: obtovl 8 GB 8 CPUs (overlap detection) -- Grid: utgovl 8 GB 8 CPUs (overlap detection) -- Grid: cor 7 GB 2 CPUs (read correction) -- Grid: ovb 3 GB 1 CPU (overlap store bucketizer) -- Grid: ovs 8 GB 1 CPU (overlap store sorting) -- Grid: red 2 GB 4 CPUs (read error detection) -- Grid: oea 1 GB 1 CPU (overlap error adjustment) -- Grid: bat 15 GB 4 CPUs (contig construction) -- Grid: cns 15 GB 4 CPUs (consensus) -- Grid: gfa 8 GB 4 CPUs (GFA alignment and processing)

-- Found Nanopore uncorrected reads in the input files.

-- Generating assembly 'oxk_loose' in '/ifs/data/blaserlab/menghan/OxfGenomes/OXK/nanopore_loose_canu'

-- Parameters:

-- genomeSize 2490000

-- Overlap Generation Limits: -- corOvlErrorRate 0.3200 ( 32.00%) -- obtOvlErrorRate 0.1440 ( 14.40%) -- utgOvlErrorRate 0.1440 ( 14.40%)

-- Overlap Processing Limits: -- corErrorRate 0.5000 ( 50.00%) -- obtErrorRate 0.1440 ( 14.40%) -- utgErrorRate 0.1440 ( 14.40%) -- cnsErrorRate 0.1920 ( 19.20%)

-- Starting command on Fri Sep 15 11:22:56 2017 with 205245.893 GB free disk space

cd /ifs/data/blaserlab/menghan/OxfGenomes/OXK/nanopore_loose_canu
qsub \
  -l h_vmem=8g \
  -l mem_free=8g \
  -pe threaded 1 \
  -S /bin/sh  \
  -cwd \
  -N 'canu_oxk_loose' \
  -j y \
  -o canu-scripts/canu.01.out canu-scripts/canu.01.sh

Your job 3550345 ("canu_oxk_loose") has been submitted

-- Finished on Fri Sep 15 11:22:56 2017 (lickety-split) with 205245.893 GB free disk space



Any chance you know what is going on? Why enabling SGE kill the process?

Thanks!!

skoren commented 7 years ago

This is the expected behavior. When Canu runs on the grid it submits itself to your grid and doesn't run any processes on the head node. It you have jobs in the queue (qstat) then Canu is running.

I would also suggest leaving off h_vmem because mem_free is probably requested on a per-core basis (Canu will set MEMORY to be total / # threads) but h_vmem is per job not core so your jobs may not run properly with it.

ml3958 commented 7 years ago

Thanks for the fast response! I did check my job queue(qstat), there's one job there shortly, but finished within seconds without any error.

I tried again with your suggestion that leaving off h_vmem. I still got no output except the two folders and the submitted job finished very quickly.

please see the message below

[lium14@phoenix2 nanopore]$ /ifs/home/lium14/tools/canu-1.6/*/bin/canu \
> -p oxk_loose \
> -d /ifs/data/blaserlab/menghan/OxfGenomes/OXK/nanopore/loose_assembly_canu \
> genomeSize=2.49m \
> -nanopore-raw /ifs/data/sequence/results/blaserlab/2017-06-12-nanopore/reads/reads.2D.fastq.gz \
> corMhapSensitivity=high corMinCoverage=0 \
> gridEngineMemoryOption="-l mem_free=MEMORY"

-- Canu 1.6
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
--   Li H.
--   Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.
--   Bioinformatics. 2016 Jul 15;32(14):2103-10.
--   http://doi.org/10.1093/bioinformatics/btw152
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--

-- Detected Java(TM) Runtime Environment '1.8.0_141' (from '/usr/lib/jvm/java-1.8.0/bin/java').
-- Detected gnuplot version '4.2 patchlevel 6 ' (from 'gnuplot') and image format 'png'.
-- Detected 32 CPUs and 126 gigabytes of memory.
-- Detected Sun Grid Engine in '/cm/shared/apps/sge/2011.11p1/default'.
-- Detected Grid Engine environment 'threaded'.
-- User supplied Grid Engine consumable '-l mem_free=MEMORY'.
--
-- WARNING:
-- WARNING:  Queue 'gpu1.q' has start mode set to 'posix_behavior' and shell set to '/bin/csh'.
-- WARNING:
-- WARNING:  Some queues in your configuration will fail to start jobs correctly.
-- WARNING:  Jobs will be submitted with option:
-- WARNING:    gridOptions=-S /bin/sh
-- WARNING:
-- WARNING:  If jobs fail to start, modify the above option to use a valid shell
-- WARNING:  and supply it directly to canu.
-- WARNING:
-- 
-- Found   1 host  with  64 cores and 1009 GB memory under Sun Grid Engine control.
-- Found   5 hosts with  32 cores and  125 GB memory under Sun Grid Engine control.
-- Found   1 host  with   8 cores and   62 GB memory under Sun Grid Engine control.
-- Found   2 hosts with  48 cores and  755 GB memory under Sun Grid Engine control.
-- Found  63 hosts with  32 cores and  252 GB memory under Sun Grid Engine control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl      8 GB    4 CPUs  (k-mer counting)
-- Grid:  cormhap    6 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     8 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl     8 GB    8 CPUs  (overlap detection)
-- Grid:  cor        7 GB    2 CPUs  (read correction)
-- Grid:  ovb        3 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8 GB    1 CPU   (overlap store sorting)
-- Grid:  red        2 GB    4 CPUs  (read error detection)
-- Grid:  oea        1 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       15 GB    4 CPUs  (contig construction)
-- Grid:  cns       15 GB    4 CPUs  (consensus)
-- Grid:  gfa        8 GB    4 CPUs  (GFA alignment and processing)
--
-- Found Nanopore uncorrected reads in the input files.
--
-- Generating assembly 'oxk_loose' in '/ifs/data/blaserlab/menghan/OxfGenomes/OXK/nanopore/loose_assembly_canu'
--
-- Parameters:
--
--  genomeSize        2490000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1440 ( 14.40%)
--    utgOvlErrorRate 0.1440 ( 14.40%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1440 ( 14.40%)
--    utgErrorRate    0.1440 ( 14.40%)
--    cnsErrorRate    0.1920 ( 19.20%)
----------------------------------------
-- Starting command on Fri Sep 15 11:35:18 2017 with 205218.223 GB free disk space

    cd /ifs/data/blaserlab/menghan/OxfGenomes/OXK/nanopore/loose_assembly_canu
    qsub \
      -l mem_free=8g \
      -pe threaded 1 \
      -S /bin/sh  \
      -cwd \
      -N 'canu_oxk_loose' \
      -j y \
      -o canu-scripts/canu.01.out canu-scripts/canu.01.sh
Your job 3550407 ("canu_oxk_loose") has been submitted

-- Finished on Fri Sep 15 11:35:18 2017 (lickety-split) with 205218.242 GB free disk space
----------------------------------------

After running the command, I checked the job queue and the job finished within seconds.

[lium14@phoenix2 nanopore]$ 

[lium14@phoenix2 nanopore]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
3550407 0.00000 canu_oxk_l lium14       qw    09/15/2017 11:35:18                                    1        
[lium14@phoenix2 nanopore]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
3550407 0.00000 canu_oxk_l lium14       qw    09/15/2017 11:35:18                                    1        
[lium14@phoenix2 nanopore]$ ls
canu_loose.sh  canu_run.sh  loose_assembly_canu  nanopore_canu_old  nanopore_loose_canu_old
[lium14@phoenix2 nanopore]$ qstat
[lium14@phoenix2 nanopore]$ qstat
[lium14@phoenix2 nanopore]$ qstat
[lium14@phoenix2 nanopore]$ cd loose_assembly_canu/
[lium14@phoenix2 loose_assembly_canu]$ ls
canu-logs  canu-scripts

skoren commented 7 years ago

The job status qw indicates it is waiting to be scheduled by the system so it hasn't run yet. You would have a canu.out file in your run folder if it had been scheduled/started running. Is there a canu.out from your previous job, can you post its contents if it is there.

Also, avoid running the same command (e.g. same directory) while another job is still in the queue/running as the two will like collide and cause an error.

ml3958 commented 7 years ago

Thank you so much Sergey! I think I did run two jobs with same command and it might mess up the process.

This time I killed all jobs and started a new one. I put the following command in run.sh file and then qsub run.sh

/ifs/home/lium14/tools/canu-1.6/*/bin/canu \
-p oxk \
-d /ifs/data/blaserlab/menghan/OxfGenomes/OXK/nanopore/assembly_canu_SGE \
genomeSize=2.49m \
-nanopore-raw /ifs/data/sequence/results/blaserlab/2017-06-12-nanopore/reads/reads.2D.fastq.gz \
gridEngineMemoryOption="-l mem_free=MEMORY"

I got bit further where Canu actually tried to do correction. But it terminate again before correction. I didn't have canu.out file in my folder this time(I had canu.out previously when I run with `useGrid=false)

this is what I have in the directory:

[lium14@phoenix2 assembly_canu_SGE]$ ls
canu-logs  canu-scripts  correction  correction.html  correction.html.files  oxk.report

And in the oxk.report I only have this

[CORRECTION/READS]
--
-- In gatekeeper store 'correction/oxk.gkpStore':
--   Found 31151 reads.
--   Found 110683679 bases (44.45 times coverage).
--
--   Read length histogram (one '*' equals 177.48 reads):
--        0    999      0 
--     1000   1999   6661 *************************************
--     2000   2999   5936 *********************************
--     3000   3999  12424 **********************************************************************
--     4000   4999   2030 ***********
--     5000   5999   1259 *******
--     6000   6999    857 ****
--     7000   7999    573 ***
--     8000   8999    371 **
--     9000   9999    238 *
--    10000  10999    177 
--    11000  11999    136 
--    12000  12999     84 
--    13000  13999     68 
--    14000  14999     59 
--    15000  15999     43 
--    16000  16999     53 
--    17000  17999     26 
--    18000  18999     28 
--    19000  19999     25 
--    20000  20999     18 
--    21000  21999     21 
--    22000  22999     15 
--    23000  23999     14 
--    24000  24999      9 
--    25000  25999      4 
--    26000  26999      4 
--    27000  27999      4 
--    28000  28999      3 
--    29000  29999      3 
--    30000  30999      2 
--    31000  31999      2 
--    32000  32999      0 
--    33000  33999      0 
--    34000  34999      0 
--    35000  35999      1 
--    36000  36999      0 
--    37000  37999      0 
--    38000  38999      1 
--    39000  39999      0 
--    40000  40999      0 
--    41000  41999      0 
--    42000  42999      1 
--    43000  43999      1

skoren commented 7 years ago

Can you confirm your compute nodes are allowed to submit jobs? Whats the output and contents of the canu-scripts folder (are there any *.out files there)? What does qstat report?

ml3958 commented 7 years ago

Yes I think the compute nodes are allowed to submit jobs. I've qsub other pbs scripts and it worked fine.

there's canu.01.out file in the canu-scripts folder.

/cm/local/apps/sge/var/spool/node053/job_scripts/3550522: line 9: 
/cm/shared/apps/sge/2011.11p1//common/settings.sh: No such file or directory

skoren commented 7 years ago

Ah then this is the same as issue #505 which is fixed in the tip. Unfortunately, there are a lot of other unrelated changes in the tip. Your could edit canu-1.6/Linux-amd64/bin/lib/canu/Execution.pm to remove lines 649-651:

 649     print F "if [ \"x\$SGE_ROOT\" != \"x\" ]; then \n"                                  if (getGlobal("gridEngine") eq "SGE");
 650     print F "  . \$SGE_ROOT/\$SGE_CELL/common/settings.sh\n"                            if (getGlobal("gridEngine") eq "SGE");
 651     print F "fi\n"                                                                      if (getGlobal("gridEngine") eq "SGE");

and see if that fixes your error.

ml3958 commented 7 years ago

Thanks so much! It worked! Canu 1.6 proceed!

But now I have a new error regarding java version. The issue is: the computing node I'm using has java 1.6 by default. I always manually load java 1.8 module load java/1.8.

However, when use SGE, canu submit multiple scripts automatically, how can I modify canu script so it knows to load java 1.8 before calling SGE?

Thanks!

skoren commented 7 years ago

It shouldn't need to load the module, just point it to the proper java binary (java=/full/path/to/java/binary) or you can add -V to your grid options (which means preserve your current environment for the submitted job).

marbl / canu