marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
655 stars 179 forks source link

Contig quality and read coverage help #1313

Closed marisalim closed 5 years ago

marisalim commented 5 years ago

Hi,

I am using Canu v1.8 to assemble contigs from Nanopore Minion amplicon data on a Mac. My target gene is ~650bp. This is the command I am using:

canu -p $myprefix -d $myinputdir genomeSize=1000 minReadLength=200 minOverlapLength=50 -nanopore-raw $myfastafile

Canu runs without errors, but I'm not sure how to evaluate the quality of my contigs. My understanding is that Canu requires 40x coverage (I'm using default corOutCoverage setting) to generate contigs. However, in the contigs.layout.tigInfo file, the coverage and numChildren values are often much lower than 40. For example, this is the subset of a tigInfo file where tigClass == 'contig':

tigLen coordType covStat coverage tigClass sugRept sugCirc numChildren
462 ungapped 1139.41 3.46 contig no no 5
620 ungapped 2629.27 14.99 contig no no 17
624 ungapped 1869.11 7.34 contig no no 8
651 ungapped 1713.5 4974.31 contig no no 3414
627 ungapped 96.81 3.98 contig no no 4
557 ungapped 1254.52 2.28 contig no no 3
612 ungapped 2517.91 29.56 contig no no 35
648 ungapped 0 1 contig no no 1
517 ungapped 1979.97 3.33 contig no no 5
566 ungapped 3253.67 8.85 contig no no 15
522 ungapped 1486.22 2.78 contig no no 4
564 ungapped 0 1 contig no no 1
485 ungapped 1692.5 3.51 contig no no 6
583 ungapped 2495.08 10.26 contig no no 18
563 ungapped 1446.66 2.58 contig no no 4
568 ungapped 2118.28 28.95 contig no no 48
528 ungapped 1249.57 2.15 contig no no 3
646 ungapped 1852.11 3.34 contig no no 4
443 ungapped 1291.99 4.2 contig no no 6
438 ungapped 1474.25 4.78 contig no no 7
519 ungapped 1510.24 4.32 contig no no 5
537 ungapped 0 1 contig no no 1
514 ungapped 0 1 contig no no 1

In most of these contigs, the number of reads and coverage are <10 so I don't think these values represent the depth of coverage for my contigs. I would really appreciate clarification on what these statistics mean, especially with respect to the 40x corOutCoverage setting, and whether there are other outputs from Canu that I should use to evaluate contig quality.

In addition, I get many contigs from Canu that Blast to the same species (with varying degrees of percent identity to reference sequence). Why do reads for a single species map to multiple contigs instead of just one? Is there a recommended approach for filtering which contigs to use for downstream analysis?

Thank you! Marisa

skoren commented 5 years ago

Canu defaults to the longest 40x for assembly (each contig can be a different coverage than this) but it also keeps low coverage variants that aren't represented by the longest 40x.

Normally you don't need to do any filtering post assembly. However, amplicons aren't really assemblies so Canu tends to keep too much data. Case in point, it seems like you have over 5000x coverage. You have one 651bp contig with 5000x, the rest all have <30x. I expect the 5000x coverage contig is the one you want to analyze, the rest probably represent off-target or artifacts in the data. Canu won't collapse these into the main contig if they are sufficiently different.

I'm not sure what you mean by "Why do reads for a single species map to multiple contigs instead of just one?"

marisalim commented 5 years ago

Thanks for the fast reply!

For other samples in my dataset, the contig coverage values are <20x. With samples that start with very few reads, Canu gives me a warning message saying that coverage is too low to do assembly. In other cases (as below), Canu runs without errors, yet none of the contigs are >=40x coverage. I'm still puzzled by the <40x coverage contigs - are these all artifactual contigs? They're not all off-target, as most of these contigs blast to the correct species and gene. I'd appreciate any suggestions for parameter defaults I could adjust to work better with amplicon data!

tigLen coordType covStat coverage tigClass sugRept sugCirc numChildren
616 ungapped 3918.02 18.38 contig no no 25
551 ungapped 2631.54 7.07 contig no no 12
383 ungapped 550.88 5.48 contig no no 6
586 ungapped 2677.88 3.69 contig no no 6
596 ungapped 2605.58 3.33 contig no no 6
523 ungapped 1569.88 3.48 contig no no 5
554 ungapped 2244.74 3.1 contig no no 5
655 ungapped 269.07 3.93 contig no no 4
602 ungapped 329.32 3.9 contig no no 4
590 ungapped 2010.44 3.44 contig no no 4
342 ungapped 1040.33 3.24 contig no no 4
353 ungapped 895.72 3.11 contig no no 4
368 ungapped 1462.12 2.79 contig no no 4
528 ungapped 1666.98 2.68 contig no no 4
520 ungapped 2010.44 2.66 contig no no 4

Also, does the numChildren number of reads represent the depth or breadth of coverage along the contig? From this discussion, I am assuming that the 'coverage' column is indeed average depth of coverage. Just want to be extra sure I'm interpreting the outputs correctly!

For my question about the multiple contigs - in the contig table from my 1st post, 9 of the 13 contigs blast to one species (this includes the very high coverage contig). I am wondering why the reads that make up these 9 contigs don't get thrown into a single contig. Based on your answer to my other question, it seems that these contigs are sufficiently different (even though they blast to the same species) so Canu will not collapse them into a single contig. Is this right?

Thanks again!

skoren commented 5 years ago

Canu will make contigs from as little as a single read so the coverage doesn't have to be 40x before it makes an assembly. How much input coverage (raw and corrected) do you have for the datasets where you end up with all coverage <20x? Numchildren is the number of reads in the contig.

It is a combination of sufficiently different and the fact that in the case of amplicons you aren't really doing assembly. Each read is the full result so the "contigs" are going to be the longest reads and everything contained in them. That will likely leave lots of variation between the different reads. You could try the smash haplotypes parameters from the FAQ to see if that reduces these extra contigs. You could also take the contig with the most coverage/reads and (assuming the other contigs blast to it) and just keep that for downstream.

marisalim commented 5 years ago

For the example with <20x coverage, these are the correction/layout tables from the Canu report output:

category original raw reads w/ overlaps original raw reads w/o/overlaps
Number of Reads 19607 455
Number of Bases 11798318 116180
Coverage 11798.318 116.18
Median 608 0
Mean 601 255
N50 609 604
Minimum 206 0
Maximum 815 760
category evidence reads corrected   rescued  
    raw expected corrected raw expected corrected
Number of Reads 2721 63 63 13458 13458
Number of Bases 1672930 41048 40367 8136412 6593961
Coverage 1672.93 41.048 40.367 8136.412 6593.961
Median 616 649 638 608 529
Mean 614 651 640 604 489
N50 617 650 639 608 548
Minimum 348 636 634 261 201
Maximum 707 698 660 814 634

These values look very similar magnitude-wise to the other example with the 5000x contig. Ah, so for amplicons, does that mean that the coverage requirement to create a contig is lower since the majority of reads should already overlap over the full region (or most of it anyways)? After Canu, I use Minimap2 to map raw reads back to Canu contigs. In this dataset, 500-2500 raw reads map back per contig, so this would explain why the contigs seemed pretty good despite the low coverage values.

Thanks for the suggestion, I will check out the haplotypes parameter. Would you recommend any adjustment to the values in the FAQ on haplotype smashing (corOutCoverage=200 and correctedErrorRate=0.15)?

Thanks!!

skoren commented 5 years ago

The 40x is only a target for correction, when assembling anything, even a couple of reads are enough to build a contig. You can see the 40x in the expected correction column. These are corrected reads in the assembly though so each one of them could have been built from hundreds of other raw reads. That's why their accuracy will be high (even 1 is already relatively high accuracy) and when you map raw reads you get much higher sets of mapped reads. You would probably get just as good a result if you took the trimmed reads and picked the longest one.

As I expected, with amplicons canu tends to over-rescue a lot of coverage (over 6000x in your table). These reads seem shorter than the amplicon but it is a little surprising that the highest coverage contig is only a few reads in this case, I expect there are lots of short variants between these reads due to random noise preventing them from being collapsed. I'd try increasing the correctedErrorRate to see what happens, otherwise stick with post-filtering of all but the contig with the highest number of reads.

marisalim commented 5 years ago

Aha, I was wondering why the rescued coverage was so high. I will try adjusting the correctedErrorRate parameter.

Thanks for your help!

marisalim commented 5 years ago

Hi - I wanted to follow up with adjusting the correctedErrorRate parameter. My previous runs had correctedErrorRate=0.12 (that seems to be the default as I didn't set it in my canu command, although the Canu documentation says 0.144 is the default for Nanopore reads...), and I increased it to 0.16.

I was following the suggestion from the Canu FAQ:

The default is 0.045 for PacBio reads, and 0.144 for Nanopore reads.

For low coverage:

        For less than 30X coverage, increase the alllowed difference in overlaps by a few percent 
(from 4.5% to 8.5% (or more) with correctedErrorRate=0.105 for PacBio and from 14.4% to 16% 
(or more) with correctedErrorRate=0.16 for Nanopore), to adjust for inferior read correction. 
Canu will automatically reduce corMinCoverage to zero to correct as many reads as possible.

As you suggested, this had the effect of reducing the number of contigs per sample. Most of these contigs were closer to the target length of 650bp and the coverage increased as well. Unfortunately, this change only improved my consensus sequence for 1 of 3 samples (validated by blast alignment to Sanger seq for the same sample). So, perhaps 16% was too big a jump from a 12% corrected error rate and allowed too many differences? I'm not sure how sensitive this parameter is.

skoren commented 5 years ago

The default is 12% since version 1.8, it was 14.4 before that. It's good that the increased error rate collapsed more of the contigs down.

The consensus isn't going to change much and I wouldn't expect it to improve with the higher error rate. You would need to run something like medaka or nano polish to get an improved consensus from the assembly.

marisalim commented 5 years ago

Ok, thanks for all your help!