Closed johnomics closed 7 years ago
Just to answer the questions:
The obtOvlHashBlockLength optimization won't impact anything. We probably should make it smaller for smaller assemblies. If this is too small on large assemblies, you can end up with tens of thousands of jobs, which is a pain to run.
There are no algorithmic restrictions on minOverlapLength and minReadLength. It doesn't make a whole lot of sense to have minOverlap > minRead (reads shorter than minOverlap won't have any overlaps) so that isn't allowed. minRead (obviously) throws out data that you're claiming wont' be useful for assembly, while minOverlap throws out data that could be used to get across low coverage areas, at risk of confusing the assembler. If your coverage supports it, increasing this past the nasty repeats is ideal. However, any repeat not spanned by a read won't be assembled through, and edges won't exist in the GFA outputs.
Thanks for the clarifications. I'll continue to use these options.
Any thoughts on what's causing the crash? I now have three separate assemblies for three separate species failing at this point (I re-ran the assembly above with a larger minReadLength just in case this was causing problems). Here are the command lines (sorry for being cagey about the species, they're not my samples):
/cluster/john/bin/canu/canu/Linux-amd64/bin/canu gridOptions="-q cluster" gridOptionscormhap="-l h_vmem=4G -pe smp 12" stopOnReadQuality=false stageDirectory=/data/tmp/john-\$JOB_ID-\$SGE_TASK_ID gnuplotImageFormat=svg -p asm -d asm_canu genomeSize=13.3m obtOvlHashBlockLength=10000000 utgOvlHashBlockLength=10000000 corMhapSensitivity=normal corOutCoverage=100 minOverlapLength=3000 minReadLength=10000 correctedErrorRate=0.065 -nanopore-raw asm.fastq.gz
/cluster/john/bin/canu/canu/Linux-amd64/bin/canu gridOptions="-q cluster" gridOptionscormhap="-l h_vmem=4g -pe smp 12" stopOnReadQuality=false stageDirectory=/data/tmp/john-\$JOB_ID-\$SGE_TASK_ID -p asm2 -d asm2_canu genomeSize=14.1m obtOvlHashBlockLength=10000000 utgOvlHashBlockLength=10000000 correctedErrorRate=0.075 -nanopore-raw asm2.fastq.gz
/cluster/john/bin/canu/canu/Linux-amd64/bin/canu gridOptions="-q cluster" gridOptionscor="-l h_vmem=4g -pe smp 12" stopOnReadQuality=false stageDirectory=/data/tmp/john-\$JOB_ID-\$SGE_TASK_ID -p asm3 -d asm3_canu genomeSize=100m correctedErrorRate=0.075 obtOvlHashBlockLength=10000000 utgOvlHashBlockLength=10000000 -nanopore-raw asm3.fastq.gz
Corrected read stats going into trimming, from seqkit:
file format type num_seqs sum_len min_len avg_len max_len sum_gap N50 L50
asm.correctedReads.fasta.gz FASTA DNA 53,521 1,353,218,428 10,097 25,283.9 98,585 0 25,676 20,259
asm2.correctedReads.fasta.gz FASTA DNA 9,011 552,864,587 1,121 61,354.4 217,770 0 61,160 3,661
asm3.correctedReads.fasta.gz FASTA DNA 431,393 3,016,419,984 1,001 6,992.3 62,835 0 9,948 104,858
All overlap stores appear to build successfully, and the errors are exactly the same as above.
Worth retrying with standard OvlHashBlockLength settings? Or rolling back to 1.5, or forward to newest commit?
I don't see anything obvious that could break right there. I can instrument the code and let you run that.
Can you recompile with BUILDDEBUG=1 and manually run the trimReads command? Bonus points if you can run in gdb (gdb -silent --args <trimReads and options>
) then type 'where' to get a better stack trace.
Ugh, sorry, dumb mistake - didn't make clean
before make
after pull. Recompiled after cleaning and it runs fine. Apologies. Thanks for your help.
I had that happen too - overlapInCore was crashing spectacularly and oddly. Not sure what caused this; the Makefile should have rebuilt everything. Odd dates left over from the pull?
Anyway, you got one 'bug fix' out of this - that uninitialized value in Carp.pm.
trimReads failed with a segmentation fault, running commit fa3eede13415829dbb5c8abc769e73d18f3be658. From
canu-scripts/canu.07.out
:overlapBasedTrimming/trimReads.C::331
is the last line of this call from main:Here's the canu command:
I'm trying to optimise our assembly but maybe some of these options are poor choices - not sure what's relevant so I'll try to explain them all. obtOvlHashBlockLength is reduced to speed up the trimming and unitigging overlapping step; previous runs with the default of 100 Mb only set off 5 jobs for overlapping, which then took a few days to run even though there was plenty of room on the cluster; on this run, 136 jobs were run and all completed in 3 hours. The overlap store appears to have been created successfully - file sizes look sensible and logs are clean.
minOverlapLength and minReadLength are increased because the minimum read Length for our top 100x coverage is 14 kb, and we have some complex repeats. I just increased minReadLength so it wasn't shorter than minOverlapLength. But is it a bad idea for these to be the same? Should I increase minReadLength a little?
This is R9.4 1D data so I started with correctedErrorRate=0.075 (from the FAQ) and then reduced it by 1% to 0.065 because we have high coverage (from the Parameter Reference). The coverage and MhapSensitivity are increased because I'm trying to resolve some complex haplotypes. The gridOptionscormhap setting is another clumsy attempt to get java to play nicely with our cluster, still not really resolved (https://github.com/marbl/canu/issues/298), although it worked this time, no sign of any errors during the correction stage.
Canu configuration from initial job: