marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

Unexpected loss of coverage during trimming followed by shorter than expected assembly #1092

Closed peterdfields closed 5 years ago

peterdfields commented 5 years ago

I'm running into some expected results in attempting to generate an assembly of a snail genome that, at least based upon flow cytometry, should be around 500Mbp. I have just a bit more than 70X coverage of the expected genome size, though the DNA pool is generated from pool of individuals rather than single individual extraction. The data were generated using a Sequel machine and while I initially started with the recommendations suggested for Sequel data I've iteratively moved up to a higher error allowance hoping to better deal with 'heterozygosity' generated by the pooled nature of the sample trying to improve the total length and biological content of the genome. The longest assembly so far is ~ 230Mbp. The approach that led to this best assembly is similar to Issue #1038. Specifically I've allowed correctedErrorRate=0.105 for the assembly and then for the correction and trimming steps I've included corMhapSensitivity=normal correctedErrorRate=0.105 corMinCoverage=0 corOutCoverage=100 flags. I've broken the larger assembly into individual steps because in earlier assembly attempts I noticed a surprising loss of coverage in moving between steps (only ~11X for the assembly step) and so I wanted to try to make sure I ended up with more coverage going into the assembly step (though given I'm seeing such a large reduction I figured there may well be problems in the data). Following a blobtools analysis I don't think the issue is arising due non-specific contamination. The reads are not as long as we would like simply because we just couldn't get a good enough extraction but then I'm not so concerned at this point about how fragmented the assembly is but instead that I cannot get near the expected length. I would also note that in doing a kmer based assessment of genome size using Illumina data (i.e. jellyfish followed by genomescope on another pooled DNA extraction) suggested a genome about twice the size of the flow cytometry estimate.

The reports for each of the steps going into the 230Mbp assembly can be found at the following:

Correction Trimming Assembly

The version of Canu used for all of the above is Canu snapshot v1.7 +277 changes (r8969 51947aec93acc395042ebe4e6c046a605532664c). The software is operating on a largish Linux machine running CentOS.

Please let me now if additional information would be useful. Thank you for your time and advice.

skoren commented 5 years ago

The data is very suspicious, there is no peak in coverage after trimming or correction, all the k-mers appear to be unique. Do you have any idea how much diversity there is between the pooled individuals? Is the Illumina data you're using for heterozygosity/genome size estimates a pool of a similar number of individuals orfewer? The doubled genome size estimate could be due to very high diversity.

Suggestions to try would be to switch to a fixed release (1.7.1) rather than a checkout you're using now because there were some large changes post 1.7 that might have bugs affecting your assembly. Most of the reads do have overlaps but not enough to correct them so I'd suggest trying multiple correction rounds. Running something like: corMhapSensitivity=high corMinCoverage=0 corOutCoverage=100 and taking the input to correct again several times (see FAQ for details). You may not need the corMhapSensitivity=high, corMhapSensitivity=normal may be sufficient. Then try trimming with a correctedErrorRate=0.12 ovlMerThreshold=500 or maybe even 0.15 (you need the ovlMerThreshold for speeding up the step given the high error rate). No point in trying the assembly unless you end up with 20+x of trimmed data and if you see some peak in the histogram.

peterdfields commented 5 years ago

I just wanted to provide updates on this analysis so that the issue doesn't go completely dormant. The diversity in the pool does seem to be a few percent. The pool size should be similar between the Illumina and PacBio data.

I have been going forward with the iterative correction. I am now on the fourth iteration. Through each iteration the time taken and size of intermediate files has increased substantially to the point where I was wondering if it would be reasonable to implement some of the suggestions in the FAQ for speeding up correction and reducing intermediate temp file size? I didn't want to try this without first determining if these additional parameters might negate the iterative process.

Thank you again for your advice.

skoren commented 5 years ago

You can probably try assembly after 3-4 rounds and see what the results look like rather than keep iterating and changing the parameters.

peterdfields commented 5 years ago

After finishing the fourth iteration of correction I went ahead with the trimming step and ended up with ~17X coverage. So while the coverage has gone up it's not quite to the > 20X mentioned above. I haven't yet tried the corMhapSensitivity=high option, instead using corMhapSensitivity=normal. I was thinking of now trying the corMhapSensitivity=high option but also using the suggestions from the FAQ for speeding up correction and reducting intermediate temporary file size. Does this pair of options seem reasonable?

skoren commented 5 years ago

The 17x might be enough, the question would be what does the k-mer distribution look like post-trimming? Is there any reasonable peak coverage? If it still looks like no peak, you could try the high sensitivity with the FAQ options.

skoren commented 5 years ago

Idle