marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

Advice difficult assembly #1038

Closed fmarletaz closed 5 years ago

fmarletaz commented 6 years ago

Hi - I am attempting to assemble the genome of a marine worm. Genome size was estimated by k-mer analysis around 550-600Mb with fairly high polymorphism (2.5%). I have 46Gb of sequel data (80x) with a read N50 of ~11kb, I was hoping for longer reads, but the DNA quality was limiting. My first canu run (with default parameters) yielded an assembly of 572Mb with a disappointing 42kb N50. Going through the report, I realised that I only obtained 25x of corrected reads, therefore, following the FAQ suggestions, I rerun CANU with the parameters corMhapSensitivity=normaland corOutCoverage=100, which yielded 38x of corrected reads. The assembly using those is 824Mb with a 60kb N50. This probably means that haplotypes have been partially split. Finally, I tried to use the "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" parameter to improve haplotype separation, but it didn't change anything. I was hoping to improve these results, and I wanted to ask for any suggestion to tweak the parameters further. The full report can be seen here: Worm.report.txt

Thanks a lot!

skoren commented 6 years ago

Is this sequencing from an individual worm or from a pool of individuals? The corrected reads seem to have no coverage peak which usually indicates you don't have very much coverage. This would make sense if you have a varied population of individuals and so each individual has low coverage. We've typically seen this be an issue with insect genomes like mosquitos.

You can try increasing sensitivity more, corMhapSensitivity=high corOutCoverage=100 corMinCoverage=0 and increasing the corrected error rate which is recommended for sequel data if you didn't already correctedErrorRate=0.105.

fmarletaz commented 6 years ago

Thanks for your reply. The sequencing is from DNA extracted from a single individual. I also have illumina data, and the k-mer profile looked a bit odd, with a very high first shoulder (see k31.pdf), which could also be consistent with a contamination, I am checking that using blobtools. However, trying to rerun using the parameters you adviced, I stumbled on a segmentation fault error: /var/spool/slurmd/job4008776/slurm_script: line 2062: 27329 Segmentation fault (core dumped) $bin/mhapConvert -G ../Priap.gkpStore -o ./results/$qry.mhap.ovb.WORKING ./results/$qry.mhap

skoren commented 6 years ago

Your illumina histogram is consistent with a very heterozygous genome. Really, you have 40x of pacbio sequencing not 80. I think the parameters I suggested should help.

The crash is likely a disk I/O issue due to corrupted files, remove any files named *.WORKING in your correction/1-overlapper/results/ folder and re-launch the assembly.

skoren commented 5 years ago

Closing idle, heterozygous + sequel parameters should help this assembly. Post an update if you do get a new assembly.