marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
655 stars 179 forks source link

Increase contiguity in the satellite genome #846

Closed AyushSaxena closed 6 years ago

AyushSaxena commented 6 years ago

This is my sample script

canu 1.7

canu \ -d N2_assembly_17 -p pacbio_N2_17 \ genomeSize=100m maxMemory=100g maxThreads=50 \ minReadLength=1000 minOverlapLength=500 corOutCoverage=1000 corMinCoverage=0 \ utgMhapOptions="--ordered-kmer-size 14 --ordered-sketch-size 3000" \ corMhapOptions="--ordered-kmer-size 14 --ordered-sketch-size 3000" \ gridOptions="--time=96:00:00 --account=ufgi --qos=ufgi-b" \ gridOptionsmeryl="--time=96:00:00 --account=ufgi --qos=ufgi-b" \ -pacbio-raw XTCFL_20171002_S54088_PL100088827-1_A01.subreads.fastq.gz \ -pacbio-raw XTCFL_20171002_S54088_PL100088827-1_B01.subreads.fastq.gz \ -pacbio-raw XTCFL_20171002_S54088_PL100088827-1_C01.subreads.fastq.gz

I've been assessing the contiguity of my assembly by varying sketch size and kmer size. We're working with the C.elegans genome, and our aim is to parse out structural variation in satellite DNA. So minor improvements in the assembly contiguity may not translate in gains in the 'right' regions of the genome. We have about ~91x data from 3 separate PacBio Sequel runs. (N50 raw_reads 14500)

I'm attaching the report. pacbio_N2_17.report.txt

Compared to previously reported PacBio based C.elegans assemblies, this one is mighty fine! I have three questions.

  1. I have seen improvements in my assembly contiguity with increasing sketch size, as I had anticipated. However, to identify structural variants in low complexity regions, are we better off assembling with PBcR-BLASR, instead of a mini-hash based approach? (Genome Research Khost et al, Single-molecule sequencing resolves the detailed structure of complex satellite DNA loci in Drosophila melanogaster.). We intend to use this assembly as our reference genome for aligning PacBio reads from mildly mutagenized strains.

  2. We have about 80,000 unassembled sequences representing ~40x coverage. Is that unusually high? Some of these unassembled sequences align unambiguously to the C.elegans reference genome. Should I discard corOutCoverage=1000, so that only the longest ~40x data is used for assembly? Can including more shorter, uninformative reads lead to a worse assembly? (Genome Research Tyson et al, MinION-based long-read sequencing and assembly extends the Caenorhabditis elegans reference genome)

  3. Lastly, I have a technical question. I believe that I am successfully changing kmer sizes because I see different results. But I don't see that change reported in the final report. For this run, I had used 'utgMhapOptions="--ordered-kmer-size 14' , while the unitigging report reports distribution of 22-mers. I am not sure what kmer size was used for finding overlaps among reads. The same is true for correction phase.

It's a wonderfully easy tool to use, I'm just trying to understand the nuts and bolts of the operation better.

Thank you

Ayush

skoren commented 6 years ago
  1. There's been significant changes since the assembly version in that manuscript so I'd expect Canu to have better corrected reads and assembly than the BLASR-based pipeline. Plus, running BLASR is going to be computationally prohibitive.

  2. I think it's likely due to the high coverage going into correction and due to the corMinCoverage=0 option. This keeps the full sequence of the raw read, even if it doesn't have support from other data and it is up to the trimming step to identify/remove bad sequence. Anything it doesn't trim properly will end up as unassembled. Having more data can hurt assembly, yes, especially if it's lower quality data or if the sample is heterozygous. I'd suggest trying a couple of values of corOutCoverage (default of 40, maybe 60 and 80). Have you also tried not setting corMinCoverage=0 as well to see if that improves the assembly (you could try 4, the default or 2)?

  3. The utgMhapOptions are only used if you also use the overlapper=mhap option. Otherwise, the overlapInCore algorithm is used for which you can adjust the kmer size using utgOvlMerSize/ obtOvlMerSize. You're not currently changing the seed k-mer size for MHAP (which is changed by the corMhapMerSize parameter). The option you're changing now affects the second-stage filter, likely making your overlaps have more accurate identity estimates. 14-mer should already be the default given your coverage. You can tell the k-mer size used from the histogram k-mer size in the report (in your case 16 for correction and 22 for the rest).

AyushSaxena commented 6 years ago

Thanks for your quick response, and clarifying my missteps.

I have tried corOutCoverage=40-90 and corMinCoverage=0,4, and the assembly didn't change much to my naive eye. I'm quite new to genome assemblies (God bless the C.elegans genome!), and I'm looking at the size of the largest contig, N50, L50, contiguity in Mummer plots and plots of gfa files using Bandage.

The report I had attached here is the best assembly I have, and most of the gains in contiguity I had observed followed an increase in sketch size. (which I now understand only affect the second-stage 'bottom sketch' during correction, and nothing else(?).

So far these assemblies have only put a moderate strain on our computation resources, and I've been exploring ways to improve it further at 'some' computational cost (hence my question). I plan to run a few tests with utgOvlMerSize=24, 26, 28 corMinCoverage=4 corOutCoverage= 40, 60 corMhapOptions="--ordered-sketch-size 3000" defaults for the rest.

  1. For assembling satellite DNA better, do you recommend tweaking any other parameter? If I understand it correctly, I will lose sensitivity if I increase k-mer size during overlapping at a certain point, so I should increase it until I begin to observe greater fragmentation in contigs, right?
  2. Should I also use overlapper=mhap option for a similar k-mer range?

Thank you

Ayush

skoren commented 6 years ago

The sketch size controls how well you estimate identity which is then used in Canu's filtering to select the best overlaps to correct a read (based on identity * length of overlap) so having better identity estimates may have improved this filter's specificity and led to better assemblies.

You can also try varying the correction mer size with corMhapMerSize=12,14,16,18,20 or so.

  1. Most of the parameters essentially get auto-tuned based on your data. We rarely modify defaults. You could try increasing minimum read/overlap sizes but I expect it won't change much other than breaking your assembly when the sizes get too large. You could try bat options too (batOptions="-dg 3 -db 3 -dr 1") and vary dg/db/dr between 1 and 6 (keep dg == db), this controls how strictly overlaps are filtered and might help separate some close repeats more.

  2. Not really necessary, that is used by the fast option to speed up the trimming/unitigging steps.

AyushSaxena commented 6 years ago

Does the option corMhapOptions="--ordered-sketch-size 3000" change the sketch size in both the first and second stage filter? Can they be manipulated independently? I'm sorry I couldn't find that in the original paper or the documentation.

Thanks for recommending tweaks in batOptions. I hadn't considered that.

skoren commented 6 years ago

That option only changes the second-stage filter. They can be manipulated independently. The first-stage filter is controlled with --num-hashes