marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
659 stars 179 forks source link

2.1.1: overlapInCorePartition generates 67GB large shellscript full of just if and fi #1924

Closed mmokrejs closed 3 months ago

mmokrejs commented 3 years ago

Hi, it happened to me that overlapInCorePartition generated too many partitions and a 67GB large correction/1-overlapper/overlap.sh shellscript, practically just huge if/else code which a call to perl utility afterwards. Interpreting this file takes several CPU cores.

$ canu/build/bin/overlapInCorePartition
ERROR:  Hash length (-hl) must be specified.
ERROR:  Reference length (-rl) must be specified.
ERROR:  seqStore (-S) must be supplied.
usage: /storage/plzen1/home/mmokrejs/apps/canu/build/bin/overlapInCorePartition [opts]
  Someone should write the command line help.
  But this is only used interally to canu, so...
$

When canu configured itself, it logged:

--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     64.000 GB    8 CPUs x  16 jobs  1024.000 GB 128 CPUs  (k-mer counting)
-- Local: hap       16.000 GB   64 CPUs x   2 jobs    32.000 GB 128 CPUs  (read-to-haplotype assignment)
-- Local: corovl     8.000 GB    1 CPU  x 128 jobs  1024.000 GB 128 CPUs  (overlap detection)
-- Local: obtovl    24.000 GB   16 CPUs x   8 jobs   192.000 GB 128 CPUs  (overlap detection)
-- Local: utgovl    24.000 GB   16 CPUs x   8 jobs   192.000 GB 128 CPUs  (overlap detection)
-- Local: cor       24.000 GB    4 CPUs x  32 jobs   768.000 GB 128 CPUs  (read correction)
-- Local: ovb        4.000 GB    1 CPU  x 128 jobs   512.000 GB 128 CPUs  (overlap store bucketizer)
-- Local: ovs       32.000 GB    1 CPU  x  62 jobs  1984.000 GB  62 CPUs  (overlap store sorting)
-- Local: red       64.000 GB    8 CPUs x  16 jobs  1024.000 GB 128 CPUs  (read error detection)
-- Local: oea        8.000 GB    1 CPU  x 128 jobs  1024.000 GB 128 CPUs  (overlap error adjustment)
-- Local: bat      1024.000 GB   64 CPUs x   1 job   1024.000 GB  64 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB    8 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)

Then, it did its math:

-- OVERLAPPER (normal) (correction) erate=0.32
--
----------------------------------------
-- Starting command on Sat Feb 13 23:39:51 2021 with 270611.71 GB free disk space

    cd correction/1-overlapper
    /auto/plzen1/home/mmokrejs/apps/canu-2.1.1/build/bin/overlapInCorePartition \
     -S  ../../my_genome.seqStore \
     -hl 2500000 \
     -rl 2000000 \
     -ol 500 \
     -o  ./my_genome.partition \
    > ./my_genome.partition.err 2>&1

-- Finished on Sat Feb 13 23:56:51 2021 (1020 seconds) with 270368.935 GB free disk space
----------------------------------------
--
-- Configured 492959931 overlapInCore jobs.
-- Finished stage 'cor-overlapConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'corovl' concurrent execution on Sun Feb 14 04:54:43 2021 with 269805.128 GB free disk space (492959931 processes; 128 concurrently)

I think for sure a roadblock should be installed somewhere so this huge piece of code should be prevented from further processing. Than we can think of what to change in the code (dunno where).

BTW, the if and fi should at least be turned into if and elif.

if [ $jobid -eq 90 ] ; then
  bat="001"
  job="001/000090"
  opt="-h 5595-5812 -r 752-1503 --hashdatalen 2525793"
fi

if [ $jobid -eq 91 ] ; then
  bat="001"
  job="001/000091"
  opt="-h 5595-5812 -r 1504-2250 --hashdatalen 2525793"
fi

if [ $jobid -eq 92 ] ; then
  bat="001"
  job="001/000092"
  opt="-h 5595-5812 -r 2251-2770 --hashdatalen 2525793"
fi

if [ $jobid -eq 93 ] ; then
  bat="001"
  job="001/000093"
  opt="-h 5595-5812 -r 2771-3197 --hashdatalen 2525793"
fi

if [ $jobid -eq 94 ] ; then
  bat="001"
  job="001/000094"
  opt="-h 5595-5812 -r 3198-3562 --hashdatalen 2525793"
fi
skoren commented 3 years ago

You're running correction which should not use overlapInCore by default and so likely it hasn't been configured correctly/tested. It is definitely not recommended or supported to run overlapInCore for this step. Did you set canu to use overlapInCore instead of the default mhap? As for the if/elif, the script is auto-generated so it makes sense to keep all the format of all the partitions the same.

mmokrejs commented 3 years ago

Here is the commandline from the logs how it was configured for the first time:

canu useGrid=false -p my_genome -d my_genome_2.1.1_correctedErrorRate_0.16_v4 genomeSize=6.8g correctedErrorRate=0.16 Overlapper=ovl corMhapSensitivity=high MhapMerSize=10 -nanopore

The Overlapper=ovl is wrong?

Here is what I tried so far (some did not finish yet) ;-)

canu ... genomeSize=6.8g correctedErrorRate=0.16 corMhapSensitivity=high -nanopore blah.fastq
canu ... genomeSize=6.8g correctedErrorRate=0.16 Overlapper=ovl corMhapSensitivity=high
canu ... genomeSize=6.8g correctedErrorRate=0.16 Overlapper=ovl corMhapSensitivity=high MhapMerSize=10 -nanopore
canu ... genomeSize=6.8g correctedErrorRate=0.16 Overlapper=ovl corMhapSensitivity=high ovlMerSize=10 MhapMerSize=10
canu ... genomeSize=6.8g correctedErrorRate=0.16 Overlapper=minimap MMapMerSize=10 minInputCoverage=8 stopOnLowCoverage=8

canu ... genomeSize=6.8g correctedErrorRate=0.16 Overlapper=ovl corMhapSensitivity=high ovlMerSize=10 MhapMerSize=10
canu ... genomeSize=6.8g correctedErrorRate=0.30 Overlapper=minimap MMapMerSize=10 minInputCoverage=4 stopOnLowCoverage=3 
canu ... genomeSize=6.8g correctedErrorRate=0.30 Overlapper=minimap OvlMerSize=10 OvlMerThreshold=10000 MMapMerSize=10 \
                        minOverlapLength=250 minReadLength=500 minInputCoverage=4  corMhapSensitivity=high stopOnLowCoverage=3 \
                        corMinCoverage=0 corOutCoverage=999 contigFilter="1 0 1.0 0.5 0"

I have only 5x genomic DNA coverage (haploid), 10x seemingly but the input sample was a diploid tissue, hence only 5x.

skoren commented 3 years ago

Yeah, overlapper=ovl is the issue, we should just not allow it as an option for correction. The ovlMerSize of 10 is pretty small, I don't know that will actually work or produce tons of overlaps. You're also not likely to get any reasonable assembly from 5-10x of data and it will take a while to run on a single node given the genome size.

mmokrejs commented 3 years ago

About, the limited amount data I know. We may get more, but I wonder whether we hop to 20-30x range (haploid). Provided canu cannot mix Nanopore and Pacbio HiFi reads I think we should go for another batch of Nanopore reads, probably again 1D reads. We just received some Illumina 1.3 billion of 2x150 PE reads for polishing. We have some 800 Mbp assembled using canu (read correction) and shasta assembler (somebody else did that). Probably not ideal but at least something to work with. How would you tweak my above commands to get something meaningful out?

My above attempts with canu yielded:

attempt6 gave correctedErrorRate=0.16 Overlapper=minimap MMapMerSize=10 minInputCoverage=8 stopOnLowCoverage=8

[UNITIGGING/CONSENSUS] -- Found, in version 2, after consensus generation: -- contigs: 3369 sequences, total length 243235699 bp (including 18 repeats of total length 2686338 bp). -- bubbles: 149 sequences, total length 9140444 bp. -- unassembled: 4158198 sequences, total length 62316023640 bp.

attempt8 correctedErrorRate=0.30 Overlapper=minimap MMapMerSize=10 minInputCoverage=4 stopOnLowCoverage=3

[UNITIGGING/CONSENSUS] -- Found, in version 2, after consensus generation: -- contigs: 2175 sequences, total length 146873953 bp (including 22 repeats of total length 2434558 bp). -- bubbles: 103 sequences, total length 6134802 bp. -- unassembled: 4236519 sequences, total length 65007713001 bp.

attempt9 gave correctedErrorRate=0.30 minOverlapLength=250 minReadLength=500 Overlapper=minimap OvlMerSize=10 OvlMerThreshold=10000 MMapMerSize=10 minInputCoverage=4 corMhapSensitivity=high stopOnLowCoverage=3 corMinCoverage=0 corOutCoverage=999 contigFilter="1 0 1.0 0.5 0"

[UNITIGGING/CONSENSUS] -- Found, in version 2, after consensus generation: -- contigs: 8485 sequences, total length 421363278 bp (including 18 repeats of total length 2316440 bp). -- bubbles: 93 sequences, total length 4680486 bp. -- unassembled: 4443380 sequences, total length 64856037836 bp.

skoren commented 3 years ago

I'd vote for 20x HiFi over 20x ONT personally.

When you use the non-default overlapper options like minimap or mhap, you probably also need to specify utgReAlign=true (the shortcut -fast is overlapper=mhap utgReAlign=true), otherwise you'll likely end up with lots of overlaps that aren't usable for assembly. This will slow down the run. Honestly, having about 4gb assembled for a 7gb genome at 5-10x isn't bad.

mmokrejs commented 3 years ago

Thak you for you comments. If you think it is better to sacrifice the ONT data, I am fine to follow you. Myself would not dare to say that. Great, I wil incorporate the options you propose. I was happy to get any output from canu so far. It was a random shot. You may want to improve the docs.

Did I say 4Gbp assembled? Noo, 420 Mbp unless I am blind. It is not bad either, we care about protein-coding regions only, so I would myself be happy to shrink down the 62-64 Gbp of the remaining unassembled data into something more compact and easier to search through using blastx. ;-)

skoren commented 3 years ago

I added some info for the utgReAlign option to the docs. The tip also defaults it to on so you'd have to turn it off explicitly.

Yes, 420 you're right, I misread it on GitHub, 4gb would be too good. You don't have to throw away the ONT data, you can use it to fill gaps and/or for validation of the HiFi assembly.

mmokrejs commented 3 years ago

So how would you modify the commands I tried so far in https://github.com/marbl/canu/issues/1924#issuecomment-806080381 ? The minimap2-based assemblies finished, the remaining not. I will recompile canu to current master and explicitly include overlapper=mhap utgReAlign=true or overlapper=minimap utgReAlign=true in the cmdlin. Any additional tweaks?

Meanwhile may have some first set of Ilumina-only contigs, so I could think of correcting the nanopore raw reads before canu assembly.

skoren commented 3 years ago

There isn't a command that'd speed it up significantly, a 7gb genome will take time, especially on a single node with 100ish cores. You can try overlapper=mhap or minimap and utgReAlign=true, it's compatible with both overlappers and might help. If you have Illumina data + low coverage nanopore, you could try a hybrid assembler like Masurca which is designed for that data type.

mmokrejs commented 3 years ago

Thank you.

mmokrejs commented 2 years ago

Hi @skoren , please replace the fi\n\nif with elif as I already mentioned in original posting. This will also speedup evaluation of the code by the shell.

skoren commented 2 years ago

The shell code here is auto-generated so it would add complexity to the perl scripts to special case first/last statements then. In the interest of keeping that code simple I'd prefer to leave the if construct alone. I don't think it makes much difference in shell speed assuming a reasonable number of jobs/partitions.