Closed marisalim closed 5 years ago
Canu defaults to the longest 40x for assembly (each contig can be a different coverage than this) but it also keeps low coverage variants that aren't represented by the longest 40x.
Normally you don't need to do any filtering post assembly. However, amplicons aren't really assemblies so Canu tends to keep too much data. Case in point, it seems like you have over 5000x coverage. You have one 651bp contig with 5000x, the rest all have <30x. I expect the 5000x coverage contig is the one you want to analyze, the rest probably represent off-target or artifacts in the data. Canu won't collapse these into the main contig if they are sufficiently different.
I'm not sure what you mean by "Why do reads for a single species map to multiple contigs instead of just one?"
Thanks for the fast reply!
For other samples in my dataset, the contig coverage values are <20x. With samples that start with very few reads, Canu gives me a warning message saying that coverage is too low to do assembly. In other cases (as below), Canu runs without errors, yet none of the contigs are >=40x coverage. I'm still puzzled by the <40x coverage contigs - are these all artifactual contigs? They're not all off-target, as most of these contigs blast to the correct species and gene. I'd appreciate any suggestions for parameter defaults I could adjust to work better with amplicon data!
tigLen | coordType | covStat | coverage | tigClass | sugRept | sugCirc | numChildren |
---|---|---|---|---|---|---|---|
616 | ungapped | 3918.02 | 18.38 | contig | no | no | 25 |
551 | ungapped | 2631.54 | 7.07 | contig | no | no | 12 |
383 | ungapped | 550.88 | 5.48 | contig | no | no | 6 |
586 | ungapped | 2677.88 | 3.69 | contig | no | no | 6 |
596 | ungapped | 2605.58 | 3.33 | contig | no | no | 6 |
523 | ungapped | 1569.88 | 3.48 | contig | no | no | 5 |
554 | ungapped | 2244.74 | 3.1 | contig | no | no | 5 |
655 | ungapped | 269.07 | 3.93 | contig | no | no | 4 |
602 | ungapped | 329.32 | 3.9 | contig | no | no | 4 |
590 | ungapped | 2010.44 | 3.44 | contig | no | no | 4 |
342 | ungapped | 1040.33 | 3.24 | contig | no | no | 4 |
353 | ungapped | 895.72 | 3.11 | contig | no | no | 4 |
368 | ungapped | 1462.12 | 2.79 | contig | no | no | 4 |
528 | ungapped | 1666.98 | 2.68 | contig | no | no | 4 |
520 | ungapped | 2010.44 | 2.66 | contig | no | no | 4 |
Also, does the numChildren number of reads represent the depth or breadth of coverage along the contig? From this discussion, I am assuming that the 'coverage' column is indeed average depth of coverage. Just want to be extra sure I'm interpreting the outputs correctly!
For my question about the multiple contigs - in the contig table from my 1st post, 9 of the 13 contigs blast to one species (this includes the very high coverage contig). I am wondering why the reads that make up these 9 contigs don't get thrown into a single contig. Based on your answer to my other question, it seems that these contigs are sufficiently different (even though they blast to the same species) so Canu will not collapse them into a single contig. Is this right?
Thanks again!
Canu will make contigs from as little as a single read so the coverage doesn't have to be 40x before it makes an assembly. How much input coverage (raw and corrected) do you have for the datasets where you end up with all coverage <20x? Numchildren is the number of reads in the contig.
It is a combination of sufficiently different and the fact that in the case of amplicons you aren't really doing assembly. Each read is the full result so the "contigs" are going to be the longest reads and everything contained in them. That will likely leave lots of variation between the different reads. You could try the smash haplotypes parameters from the FAQ to see if that reduces these extra contigs. You could also take the contig with the most coverage/reads and (assuming the other contigs blast to it) and just keep that for downstream.
For the example with <20x coverage, these are the correction/layout tables from the Canu report output:
category | original raw reads w/ overlaps | original raw reads w/o/overlaps |
---|---|---|
Number of Reads | 19607 | 455 |
Number of Bases | 11798318 | 116180 |
Coverage | 11798.318 | 116.18 |
Median | 608 | 0 |
Mean | 601 | 255 |
N50 | 609 | 604 |
Minimum | 206 | 0 |
Maximum | 815 | 760 |
category | evidence reads | corrected | rescued | ||
---|---|---|---|---|---|
raw | expected corrected | raw | expected corrected | ||
Number of Reads | 2721 | 63 | 63 | 13458 | 13458 |
Number of Bases | 1672930 | 41048 | 40367 | 8136412 | 6593961 |
Coverage | 1672.93 | 41.048 | 40.367 | 8136.412 | 6593.961 |
Median | 616 | 649 | 638 | 608 | 529 |
Mean | 614 | 651 | 640 | 604 | 489 |
N50 | 617 | 650 | 639 | 608 | 548 |
Minimum | 348 | 636 | 634 | 261 | 201 |
Maximum | 707 | 698 | 660 | 814 | 634 |
These values look very similar magnitude-wise to the other example with the 5000x contig. Ah, so for amplicons, does that mean that the coverage requirement to create a contig is lower since the majority of reads should already overlap over the full region (or most of it anyways)? After Canu, I use Minimap2 to map raw reads back to Canu contigs. In this dataset, 500-2500 raw reads map back per contig, so this would explain why the contigs seemed pretty good despite the low coverage values.
Thanks for the suggestion, I will check out the haplotypes parameter. Would you recommend any adjustment to the values in the FAQ on haplotype smashing (corOutCoverage=200
and correctedErrorRate=0.15
)?
Thanks!!
The 40x is only a target for correction, when assembling anything, even a couple of reads are enough to build a contig. You can see the 40x in the expected correction column. These are corrected reads in the assembly though so each one of them could have been built from hundreds of other raw reads. That's why their accuracy will be high (even 1 is already relatively high accuracy) and when you map raw reads you get much higher sets of mapped reads. You would probably get just as good a result if you took the trimmed reads and picked the longest one.
As I expected, with amplicons canu tends to over-rescue a lot of coverage (over 6000x in your table). These reads seem shorter than the amplicon but it is a little surprising that the highest coverage contig is only a few reads in this case, I expect there are lots of short variants between these reads due to random noise preventing them from being collapsed. I'd try increasing the correctedErrorRate to see what happens, otherwise stick with post-filtering of all but the contig with the highest number of reads.
Aha, I was wondering why the rescued coverage was so high. I will try adjusting the correctedErrorRate parameter.
Thanks for your help!
Hi - I wanted to follow up with adjusting the correctedErrorRate parameter. My previous runs had correctedErrorRate=0.12 (that seems to be the default as I didn't set it in my canu command, although the Canu documentation says 0.144 is the default for Nanopore reads...), and I increased it to 0.16.
I was following the suggestion from the Canu FAQ:
The default is 0.045 for PacBio reads, and 0.144 for Nanopore reads.
For low coverage:
For less than 30X coverage, increase the alllowed difference in overlaps by a few percent
(from 4.5% to 8.5% (or more) with correctedErrorRate=0.105 for PacBio and from 14.4% to 16%
(or more) with correctedErrorRate=0.16 for Nanopore), to adjust for inferior read correction.
Canu will automatically reduce corMinCoverage to zero to correct as many reads as possible.
As you suggested, this had the effect of reducing the number of contigs per sample. Most of these contigs were closer to the target length of 650bp and the coverage increased as well. Unfortunately, this change only improved my consensus sequence for 1 of 3 samples (validated by blast alignment to Sanger seq for the same sample). So, perhaps 16% was too big a jump from a 12% corrected error rate and allowed too many differences? I'm not sure how sensitive this parameter is.
The default is 12% since version 1.8, it was 14.4 before that. It's good that the increased error rate collapsed more of the contigs down.
The consensus isn't going to change much and I wouldn't expect it to improve with the higher error rate. You would need to run something like medaka or nano polish to get an improved consensus from the assembly.
Ok, thanks for all your help!
Hi,
I am using Canu v1.8 to assemble contigs from Nanopore Minion amplicon data on a Mac. My target gene is ~650bp. This is the command I am using:
canu -p $myprefix -d $myinputdir genomeSize=1000 minReadLength=200 minOverlapLength=50 -nanopore-raw $myfastafile
Canu runs without errors, but I'm not sure how to evaluate the quality of my contigs. My understanding is that Canu requires 40x coverage (I'm using default
corOutCoverage
setting) to generate contigs. However, in the contigs.layout.tigInfo file, the coverage and numChildren values are often much lower than 40. For example, this is the subset of a tigInfo file where tigClass == 'contig':In most of these contigs, the number of reads and coverage are <10 so I don't think these values represent the depth of coverage for my contigs. I would really appreciate clarification on what these statistics mean, especially with respect to the 40x
corOutCoverage
setting, and whether there are other outputs from Canu that I should use to evaluate contig quality.In addition, I get many contigs from Canu that Blast to the same species (with varying degrees of percent identity to reference sequence). Why do reads for a single species map to multiple contigs instead of just one? Is there a recommended approach for filtering which contigs to use for downstream analysis?
Thank you! Marisa