Closed mmokrejs closed 3 months ago
You're running correction which should not use overlapInCore by default and so likely it hasn't been configured correctly/tested. It is definitely not recommended or supported to run overlapInCore for this step. Did you set canu to use overlapInCore instead of the default mhap? As for the if/elif, the script is auto-generated so it makes sense to keep all the format of all the partitions the same.
Here is the commandline from the logs how it was configured for the first time:
canu useGrid=false -p my_genome -d my_genome_2.1.1_correctedErrorRate_0.16_v4 genomeSize=6.8g correctedErrorRate=0.16 Overlapper=ovl corMhapSensitivity=high MhapMerSize=10 -nanopore
The Overlapper=ovl
is wrong?
Here is what I tried so far (some did not finish yet) ;-)
canu ... genomeSize=6.8g correctedErrorRate=0.16 corMhapSensitivity=high -nanopore blah.fastq
canu ... genomeSize=6.8g correctedErrorRate=0.16 Overlapper=ovl corMhapSensitivity=high
canu ... genomeSize=6.8g correctedErrorRate=0.16 Overlapper=ovl corMhapSensitivity=high MhapMerSize=10 -nanopore
canu ... genomeSize=6.8g correctedErrorRate=0.16 Overlapper=ovl corMhapSensitivity=high ovlMerSize=10 MhapMerSize=10
canu ... genomeSize=6.8g correctedErrorRate=0.16 Overlapper=minimap MMapMerSize=10 minInputCoverage=8 stopOnLowCoverage=8
canu ... genomeSize=6.8g correctedErrorRate=0.16 Overlapper=ovl corMhapSensitivity=high ovlMerSize=10 MhapMerSize=10
canu ... genomeSize=6.8g correctedErrorRate=0.30 Overlapper=minimap MMapMerSize=10 minInputCoverage=4 stopOnLowCoverage=3
canu ... genomeSize=6.8g correctedErrorRate=0.30 Overlapper=minimap OvlMerSize=10 OvlMerThreshold=10000 MMapMerSize=10 \
minOverlapLength=250 minReadLength=500 minInputCoverage=4 corMhapSensitivity=high stopOnLowCoverage=3 \
corMinCoverage=0 corOutCoverage=999 contigFilter="1 0 1.0 0.5 0"
I have only 5x genomic DNA coverage (haploid), 10x seemingly but the input sample was a diploid tissue, hence only 5x.
Yeah, overlapper=ovl
is the issue, we should just not allow it as an option for correction. The ovlMerSize of 10 is pretty small, I don't know that will actually work or produce tons of overlaps. You're also not likely to get any reasonable assembly from 5-10x of data and it will take a while to run on a single node given the genome size.
About, the limited amount data I know. We may get more, but I wonder whether we hop to 20-30x range (haploid). Provided canu cannot mix Nanopore and Pacbio HiFi reads I think we should go for another batch of Nanopore reads, probably again 1D reads. We just received some Illumina 1.3 billion of 2x150 PE reads for polishing. We have some 800 Mbp assembled using canu (read correction) and shasta assembler (somebody else did that). Probably not ideal but at least something to work with. How would you tweak my above commands to get something meaningful out?
My above attempts with canu yielded:
attempt6 gave
correctedErrorRate=0.16 Overlapper=minimap MMapMerSize=10 minInputCoverage=8 stopOnLowCoverage=8
[UNITIGGING/CONSENSUS] -- Found, in version 2, after consensus generation: -- contigs: 3369 sequences, total length 243235699 bp (including 18 repeats of total length 2686338 bp). -- bubbles: 149 sequences, total length 9140444 bp. -- unassembled: 4158198 sequences, total length 62316023640 bp.
attempt8
correctedErrorRate=0.30 Overlapper=minimap MMapMerSize=10 minInputCoverage=4 stopOnLowCoverage=3
[UNITIGGING/CONSENSUS] -- Found, in version 2, after consensus generation: -- contigs: 2175 sequences, total length 146873953 bp (including 22 repeats of total length 2434558 bp). -- bubbles: 103 sequences, total length 6134802 bp. -- unassembled: 4236519 sequences, total length 65007713001 bp.
attempt9 gave
correctedErrorRate=0.30 minOverlapLength=250 minReadLength=500 Overlapper=minimap OvlMerSize=10 OvlMerThreshold=10000 MMapMerSize=10 minInputCoverage=4 corMhapSensitivity=high stopOnLowCoverage=3 corMinCoverage=0 corOutCoverage=999 contigFilter="1 0 1.0 0.5 0"
[UNITIGGING/CONSENSUS] -- Found, in version 2, after consensus generation: -- contigs: 8485 sequences, total length 421363278 bp (including 18 repeats of total length 2316440 bp). -- bubbles: 93 sequences, total length 4680486 bp. -- unassembled: 4443380 sequences, total length 64856037836 bp.
I'd vote for 20x HiFi over 20x ONT personally.
When you use the non-default overlapper options like minimap or mhap, you probably also need to specify utgReAlign=true (the shortcut -fast is overlapper=mhap utgReAlign=true), otherwise you'll likely end up with lots of overlaps that aren't usable for assembly. This will slow down the run. Honestly, having about 4gb assembled for a 7gb genome at 5-10x isn't bad.
Thak you for you comments. If you think it is better to sacrifice the ONT data, I am fine to follow you. Myself would not dare to say that. Great, I wil incorporate the options you propose. I was happy to get any output from canu
so far. It was a random shot. You may want to improve the docs.
Did I say 4Gbp assembled? Noo, 420 Mbp unless I am blind. It is not bad either, we care about protein-coding regions only, so I would myself be happy to shrink down the 62-64 Gbp of the remaining unassembled data into something more compact and easier to search through using blastx
. ;-)
I added some info for the utgReAlign option to the docs. The tip also defaults it to on so you'd have to turn it off explicitly.
Yes, 420 you're right, I misread it on GitHub, 4gb would be too good. You don't have to throw away the ONT data, you can use it to fill gaps and/or for validation of the HiFi assembly.
So how would you modify the commands I tried so far in https://github.com/marbl/canu/issues/1924#issuecomment-806080381 ? The minimap2-based assemblies finished, the remaining not. I will recompile canu to current master
and explicitly include overlapper=mhap utgReAlign=true
or overlapper=minimap utgReAlign=true
in the cmdlin. Any additional tweaks?
Meanwhile may have some first set of Ilumina-only contigs, so I could think of correcting the nanopore raw reads before canu assembly.
There isn't a command that'd speed it up significantly, a 7gb genome will take time, especially on a single node with 100ish cores. You can try overlapper=mhap or minimap and utgReAlign=true, it's compatible with both overlappers and might help. If you have Illumina data + low coverage nanopore, you could try a hybrid assembler like Masurca which is designed for that data type.
Thank you.
Hi @skoren , please replace the fi\n\nif
with elif
as I already mentioned in original posting. This will also speedup evaluation of the code by the shell.
The shell code here is auto-generated so it would add complexity to the perl scripts to special case first/last statements then. In the interest of keeping that code simple I'd prefer to leave the if construct alone. I don't think it makes much difference in shell speed assuming a reasonable number of jobs/partitions.
Hi, it happened to me that
overlapInCorePartition
generated too many partitions and a 67GB largecorrection/1-overlapper/overlap.sh
shellscript, practically just huge if/else code which a call to perl utility afterwards. Interpreting this file takes several CPU cores.When canu configured itself, it logged:
Then, it did its math:
I think for sure a roadblock should be installed somewhere so this huge piece of code should be prevented from further processing. Than we can think of what to change in the code (dunno where).
BTW, the
if
andfi
should at least be turned intoif
andelif
.