marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

useGrid=remote tries to run locally #1260

Closed itcarroll closed 5 years ago

itcarroll commented 5 years ago

The useGrid=remote parameter is ignored (see below). The default behavior (useGrid=true) submits a job as expected.

icarroll@sshgw02:test$ ../bin/canu genomeSize=4.8m useGrid=remote -p ecoli -d ecoli-oxford -nanopore-raw /nfs/icarroll-data/support/dhawthorne/test/oxford.fasta
-- Canu snapshot v1.8 +117 changes (r9327 dc859c7b4d2065d9412d5683e71d289af6ebf7ed)
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM.
-- De novo assembly of haplotype-resolved genomes with trio binning.
-- Nat Biotechnol. 2018
-- https//doi.org/10.1038/nbt.4277
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_191' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 2 CPUs and 8 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found   2 hosts with   4 cores and    9 GB memory under Slurm control.
-- Found   4 hosts with   8 cores and  121 GB memory under Slurm control.
-- Found  20 hosts with   8 cores and   59 GB memory under Slurm control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl      9 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8 GB    4 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap    6 GB    4 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     4 GB    4 CPUs  (overlap detection)
-- Grid:  utgovl     4 GB    4 CPUs  (overlap detection)
-- Grid:  cor      --- GB    4 CPUs  (read correction)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8 GB    1 CPU   (overlap store sorting)
-- Grid:  red        8 GB    4 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       16 GB    4 CPUs  (contig construction with bogart)
-- Grid:  cns      --- GB    4 CPUs  (consensus)
-- Grid:  gfa        8 GB    4 CPUs  (GFA alignment and processing)
--
-- Found Nanopore uncorrected reads in the input files.
--
-- Generating assembly 'ecoli' in '/research-home/icarroll/support/dhawthorne/test/ecoli-oxford'
--
-- Parameters:
--
--  genomeSize        4800000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Fri Feb 22 16:52:30 2019 with 588.85 GB free disk space

    cd .
    /research-home/icarroll/src/canu/Linux-amd64/bin/sqStoreCreate \
      -o ./ecoli.seqStore.BUILDING \
      -minlength 1000 \
      ./ecoli.seqStore.ssi \
    > ./ecoli.seqStore.BUILDING.err 2>&1

-- Finished on Fri Feb 22 16:52:32 2019 (2 seconds) with 588.806 GB free disk space
----------------------------------------
--
-- WARNING: gnuplot failed.
--
----------------------------------------
--
-- In sequence store './ecoli.seqStore':
--   Found 20365 reads.
--   Found 140042151 bases (29.17 times coverage).
--
--   Read length histogram (one '*' equals 41.48 reads):
--     1000   1999    706 *****************
--     2000   2999   1682 ****************************************
--     3000   3999   1624 ***************************************
--     4000   4999   1543 *************************************
--     5000   5999   1905 *********************************************
--     6000   6999   2691 ****************************************************************
--     7000   7999   2904 **********************************************************************
--     8000   8999   2609 **************************************************************
--     9000   9999   1946 **********************************************
--    10000  10999   1280 ******************************
--    11000  11999    733 *****************
--    12000  12999    397 *********
--    13000  13999    181 ****
--    14000  14999    109 **
--    15000  15999     38 
--    16000  16999      9 
--    17000  17999      4 
--    18000  18999      2 
--    19000  19999      0 
--    20000  20999      0 
--    21000  21999      0 
--    22000  22999      1 
--    23000  23999      0 
--    24000  24999      0 
--    25000  25999      1 
----------------------------------------
-- Starting command on Fri Feb 22 16:52:33 2019 with 588.806 GB free disk space

    cd correction/0-mercounts
    ./meryl-configure.sh \
    > ./meryl-configure.err 2>&1

-- Finished on Fri Feb 22 16:52:33 2019 (in the blink of an eye) with 588.806 GB free disk space
----------------------------------------
--  segments   memory batches
--  -------- -------- -------

ABORT:
ABORT: Canu snapshot v1.8 +117 changes (r9327 dc859c7b4d2065d9412d5683e71d289af6ebf7ed)
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to parse meryl configure output 'correction/0-mercounts/ecoli.ms16.config.01.out'.
ABORT:
ABORT: Disk space available:  588.806 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (correction/0-mercounts/ecoli.ms16.config.01.out):
ABORT:
ABORT:       equal-to N           return kmers that occur exactly N times in the input.  accepts exactly one input.
ABORT:       not-equal-to N       return kmers that do not occur exactly N times in the input.  accepts exactly one input.
ABORT:   
ABORT:       increase X           add X to the count of each kmer.
ABORT:       decrease X           subtract X from the count of each kmer.
ABORT:       multiply X           multiply the count of each kmer by X.
ABORT:       divide X             divide the count of each kmer by X.
ABORT:       modulo X             set the count of each kmer to the remainder of the count divided by X.
ABORT:   
ABORT:       union                return kmers that occur in any input, set the count to the number of inputs with this kmer.
ABORT:       union-min            return kmers that occur in any input, set the count to the minimum count
ABORT:       union-max            return kmers that occur in any input, set the count to the maximum count
ABORT:       union-sum            return kmers that occur in any input, set the count to the sum of the counts
ABORT:   
ABORT:       intersect            return kmers that occur in all inputs, set the count to the count in the first input.
ABORT:       intersect-min        return kmers that occur in all inputs, set the count to the minimum count.
ABORT:       intersect-max        return kmers that occur in all inputs, set the count to the maximum count.
ABORT:       intersect-sum        return kmers that occur in all inputs, set the count to the sum of the counts.
ABORT:   
ABORT:       difference           return kmers that occur in the first input, but none of the other inputs
ABORT:       symmetric-difference return kmers that occur in exactly one input
ABORT:   
ABORT:     MODIFIERS:
ABORT:   
ABORT:       output O             write kmers generated by the present command to an output  meryl database O
ABORT:                            mandatory for count operations.
ABORT:   
ABORT:     EXAMPLES:
ABORT:   
ABORT:     Example:  Report 22-mers present in at least one of input1.fasta and input2.fasta.
ABORT:               Kmers from each input are saved in meryl databases 'input1' and 'input2',
ABORT:               but the kmers in the union are only reported to the screen.
ABORT:   
ABORT:               meryl print \
ABORT:                       union \
ABORT:                         [count k=22 input1.fasta output input1] \
ABORT:                         [count k=22 input2.fasta output input2]
ABORT:   
ABORT:     Example:  Find the highest count of each kmer present in both files, save the kmers to
ABORT:               database 'maxCount'.
ABORT:   
ABORT:               meryl intersect-max input1 input2 output maxCount
ABORT:   
ABORT:     Example:  Find unique kmers common to both files.  Brackets are necessary
ABORT:               on the first 'equal-to' command to prevent the second 'equal-to' from
ABORT:               being used as an input to the first 'equal-to'.
ABORT:   
ABORT:               meryl intersect [equal-to 1 input1] equal-to 1 input2
ABORT:   
ABORT:   Requested memory 'memory=9' (GB) is more than physical memory 7.80 GB.
ABORT:
icarroll@sshgw02:test$ ../bin/canu -version
Canu snapshot v1.8 +117 changes (r9327 dc859c7b4d2065d9412d5683e71d289af6ebf7ed)
icarroll@sshgw02:test$ cat /proc/version
Linux version 4.4.0-142-generic (buildd@lgw01-amd64-033) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019
brianwalenz commented 5 years ago

This is normal. The executive will do some light weight computes to configure the compute intensive parallel parts for grid execution, then stop and tell you to submit jobs.

What isn't normal is:

ABORT:   Requested memory 'memory=9' (GB) is more than physical memory 7.80 GB.

The meryl stage is trying to configure for a 9 GB grid job, but it fails because the machine you're on only has 8 GB. A work around is to set merylMemory=7g. For bacteria, that's way more memory than it needs.

itcarroll commented 5 years ago

Thanks for the rapid response! The workaround does not work, and a related question. Say I try this with genomeSize=980m, will the executive know not to do any work locally (not supposed to use a head node for long running jobs)? (Nevermind, I re-read your answer.)

The workaround fails with the same error ... does not seem to be listening to the memory request:

icarroll@sshgw02:test$ ../bin/canu genomeSize=4.8m useGrid=remote merylMemory=7g -p ecoli -d ecoli-oxford -nanopore-raw /nfs/icarroll-data/support/dhawthorne/test/oxford.fasta
-- Canu snapshot v1.8 +117 changes (r9327 dc859c7b4d2065d9412d5683e71d289af6ebf7ed)
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM.
-- De novo assembly of haplotype-resolved genomes with trio binning.
-- Nat Biotechnol. 2018
-- https//doi.org/10.1038/nbt.4277
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_191' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 2 CPUs and 8 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found   2 hosts with   4 cores and    9 GB memory under Slurm control.
-- Found   4 hosts with   8 cores and  121 GB memory under Slurm control.
-- Found  20 hosts with   8 cores and   59 GB memory under Slurm control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl      7 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8 GB    4 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap    6 GB    4 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     4 GB    4 CPUs  (overlap detection)
-- Grid:  utgovl     4 GB    4 CPUs  (overlap detection)
-- Grid:  cor      --- GB    4 CPUs  (read correction)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8 GB    1 CPU   (overlap store sorting)
-- Grid:  red        8 GB    4 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       16 GB    4 CPUs  (contig construction with bogart)
-- Grid:  cns      --- GB    4 CPUs  (consensus)
-- Grid:  gfa        8 GB    4 CPUs  (GFA alignment and processing)
--
-- In 'ecoli.seqStore', found Nanopore reads:
--   Raw:        20365
--   Corrected:  0
--   Trimmed:    0
--
-- Generating assembly 'ecoli' in '/research-home/icarroll/support/dhawthorne/test/ecoli-oxford'
--
-- Parameters:
--
--  genomeSize        4800000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN CORRECTION
--
--  segments   memory batches
--  -------- -------- -------

ABORT:
ABORT: Canu snapshot v1.8 +117 changes (r9327 dc859c7b4d2065d9412d5683e71d289af6ebf7ed)
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to parse meryl configure output 'correction/0-mercounts/ecoli.ms16.config.01.out'.
ABORT:
ABORT: Disk space available:  588.775 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (correction/0-mercounts/ecoli.ms16.config.01.out):
ABORT:
ABORT:       equal-to N           return kmers that occur exactly N times in the input.  accepts exactly one input.
ABORT:       not-equal-to N       return kmers that do not occur exactly N times in the input.  accepts exactly one input.
ABORT:   
ABORT:       increase X           add X to the count of each kmer.
ABORT:       decrease X           subtract X from the count of each kmer.
ABORT:       multiply X           multiply the count of each kmer by X.
ABORT:       divide X             divide the count of each kmer by X.
ABORT:       modulo X             set the count of each kmer to the remainder of the count divided by X.
ABORT:   
ABORT:       union                return kmers that occur in any input, set the count to the number of inputs with this kmer.
ABORT:       union-min            return kmers that occur in any input, set the count to the minimum count
ABORT:       union-max            return kmers that occur in any input, set the count to the maximum count
ABORT:       union-sum            return kmers that occur in any input, set the count to the sum of the counts
ABORT:   
ABORT:       intersect            return kmers that occur in all inputs, set the count to the count in the first input.
ABORT:       intersect-min        return kmers that occur in all inputs, set the count to the minimum count.
ABORT:       intersect-max        return kmers that occur in all inputs, set the count to the maximum count.
ABORT:       intersect-sum        return kmers that occur in all inputs, set the count to the sum of the counts.
ABORT:   
ABORT:       difference           return kmers that occur in the first input, but none of the other inputs
ABORT:       symmetric-difference return kmers that occur in exactly one input
ABORT:   
ABORT:     MODIFIERS:
ABORT:   
ABORT:       output O             write kmers generated by the present command to an output  meryl database O
ABORT:                            mandatory for count operations.
ABORT:   
ABORT:     EXAMPLES:
ABORT:   
ABORT:     Example:  Report 22-mers present in at least one of input1.fasta and input2.fasta.
ABORT:               Kmers from each input are saved in meryl databases 'input1' and 'input2',
ABORT:               but the kmers in the union are only reported to the screen.
ABORT:   
ABORT:               meryl print \
ABORT:                       union \
ABORT:                         [count k=22 input1.fasta output input1] \
ABORT:                         [count k=22 input2.fasta output input2]
ABORT:   
ABORT:     Example:  Find the highest count of each kmer present in both files, save the kmers to
ABORT:               database 'maxCount'.
ABORT:   
ABORT:               meryl intersect-max input1 input2 output maxCount
ABORT:   
ABORT:     Example:  Find unique kmers common to both files.  Brackets are necessary
ABORT:               on the first 'equal-to' command to prevent the second 'equal-to' from
ABORT:               being used as an input to the first 'equal-to'.
ABORT:   
ABORT:               meryl intersect [equal-to 1 input1] equal-to 1 input2
ABORT:   
ABORT:   Requested memory 'memory=9' (GB) is more than physical memory 7.80 GB.
ABORT:
brianwalenz commented 5 years ago

First, the failing memory - it found the result from the first run of 'meryl-configure.sh' (using 9gb memory). A long standing problem with Canu is that changing parameters in the middle of an assembly doesn't always work correctly. Removing 0-mercounts/ will let this be rerun.

As for head nodes and 'remote' - yup, it'll try to run some long-running jobs on the head node. In particular, there are several steps that want to process all reads or all overlaps.

An alternate strategy would be to grab an interactive node and run canu useGrid=remote on there. This would solve both the memory and time problems.

What are you trying to accomplish with useGrid=remote? Maybe we can think up a different solution.

itcarroll commented 5 years ago

First, the failing memory - it found the result from the first run of 'meryl-configure.sh' (using 9gb memory). A long standing problem with Canu is that changing parameters in the middle of an assembly doesn't always work correctly. Removing 0-mercounts/ will let this be rerun.

Yep, should have caught that. Working now. So I'm all set, but leaving open for the bug you flagged.

As for head nodes and 'remote' - yup, it'll try to run some long-running jobs on the head node. In particular, there are several steps that want to process all reads or all overlaps.

Okay, with the merylMemory config low enough to run on the head node, it actually doesn't take too much time. The reason I am trying useGrid=remote is to understand before submission what the resource request will be.

brianwalenz commented 5 years ago

I (finally) fixed the meryl configuration problem. You can now happily configure it for memory larger than available on the head node.