Contig quality and read coverage help

marisalim commented 5 years ago

Hi,

I am using Canu v1.8 to assemble contigs from Nanopore Minion amplicon data on a Mac. My target gene is ~650bp. This is the command I am using:

canu -p $myprefix -d $myinputdir genomeSize=1000 minReadLength=200 minOverlapLength=50 -nanopore-raw $myfastafile

Canu runs without errors, but I'm not sure how to evaluate the quality of my contigs. My understanding is that Canu requires 40x coverage (I'm using default corOutCoverage setting) to generate contigs. However, in the contigs.layout.tigInfo file, the coverage and numChildren values are often much lower than 40. For example, this is the subset of a tigInfo file where tigClass == 'contig':

tigLen	coordType	covStat	coverage	tigClass	sugRept	sugCirc	numChildren
462	ungapped	1139.41	3.46	contig	no	no	5
620	ungapped	2629.27	14.99	contig	no	no	17
624	ungapped	1869.11	7.34	contig	no	no	8
651	ungapped	1713.5	4974.31	contig	no	no	3414
627	ungapped	96.81	3.98	contig	no	no	4
557	ungapped	1254.52	2.28	contig	no	no	3
612	ungapped	2517.91	29.56	contig	no	no	35
648	ungapped	0	1	contig	no	no	1
517	ungapped	1979.97	3.33	contig	no	no	5
566	ungapped	3253.67	8.85	contig	no	no	15
522	ungapped	1486.22	2.78	contig	no	no	4
564	ungapped	0	1	contig	no	no	1
485	ungapped	1692.5	3.51	contig	no	no	6
583	ungapped	2495.08	10.26	contig	no	no	18
563	ungapped	1446.66	2.58	contig	no	no	4
568	ungapped	2118.28	28.95	contig	no	no	48
528	ungapped	1249.57	2.15	contig	no	no	3
646	ungapped	1852.11	3.34	contig	no	no	4
443	ungapped	1291.99	4.2	contig	no	no	6
438	ungapped	1474.25	4.78	contig	no	no	7
519	ungapped	1510.24	4.32	contig	no	no	5
537	ungapped	0	1	contig	no	no	1
514	ungapped	0	1	contig	no	no	1

In most of these contigs, the number of reads and coverage are <10 so I don't think these values represent the depth of coverage for my contigs. I would really appreciate clarification on what these statistics mean, especially with respect to the 40x corOutCoverage setting, and whether there are other outputs from Canu that I should use to evaluate contig quality.

In addition, I get many contigs from Canu that Blast to the same species (with varying degrees of percent identity to reference sequence). Why do reads for a single species map to multiple contigs instead of just one? Is there a recommended approach for filtering which contigs to use for downstream analysis?

Thank you! Marisa

skoren commented 5 years ago

Canu defaults to the longest 40x for assembly (each contig can be a different coverage than this) but it also keeps low coverage variants that aren't represented by the longest 40x.

Normally you don't need to do any filtering post assembly. However, amplicons aren't really assemblies so Canu tends to keep too much data. Case in point, it seems like you have over 5000x coverage. You have one 651bp contig with 5000x, the rest all have <30x. I expect the 5000x coverage contig is the one you want to analyze, the rest probably represent off-target or artifacts in the data. Canu won't collapse these into the main contig if they are sufficiently different.

I'm not sure what you mean by "Why do reads for a single species map to multiple contigs instead of just one?"

marisalim commented 5 years ago

Thanks for the fast reply!

For other samples in my dataset, the contig coverage values are <20x. With samples that start with very few reads, Canu gives me a warning message saying that coverage is too low to do assembly. In other cases (as below), Canu runs without errors, yet none of the contigs are >=40x coverage. I'm still puzzled by the <40x coverage contigs - are these all artifactual contigs? They're not all off-target, as most of these contigs blast to the correct species and gene. I'd appreciate any suggestions for parameter defaults I could adjust to work better with amplicon data!

tigLen	coordType	covStat	coverage	tigClass	sugRept	sugCirc	numChildren
616	ungapped	3918.02	18.38	contig	no	no	25
551	ungapped	2631.54	7.07	contig	no	no	12
383	ungapped	550.88	5.48	contig	no	no	6
586	ungapped	2677.88	3.69	contig	no	no	6
596	ungapped	2605.58	3.33	contig	no	no	6
523	ungapped	1569.88	3.48	contig	no	no	5
554	ungapped	2244.74	3.1	contig	no	no	5
655	ungapped	269.07	3.93	contig	no	no	4
602	ungapped	329.32	3.9	contig	no	no	4
590	ungapped	2010.44	3.44	contig	no	no	4
342	ungapped	1040.33	3.24	contig	no	no	4
353	ungapped	895.72	3.11	contig	no	no	4
368	ungapped	1462.12	2.79	contig	no	no	4
528	ungapped	1666.98	2.68	contig	no	no	4
520	ungapped	2010.44	2.66	contig	no	no	4

Also, does the numChildren number of reads represent the depth or breadth of coverage along the contig? From this discussion, I am assuming that the 'coverage' column is indeed average depth of coverage. Just want to be extra sure I'm interpreting the outputs correctly!

For my question about the multiple contigs - in the contig table from my 1st post, 9 of the 13 contigs blast to one species (this includes the very high coverage contig). I am wondering why the reads that make up these 9 contigs don't get thrown into a single contig. Based on your answer to my other question, it seems that these contigs are sufficiently different (even though they blast to the same species) so Canu will not collapse them into a single contig. Is this right?

Thanks again!

skoren commented 5 years ago

Canu will make contigs from as little as a single read so the coverage doesn't have to be 40x before it makes an assembly. How much input coverage (raw and corrected) do you have for the datasets where you end up with all coverage <20x? Numchildren is the number of reads in the contig.

It is a combination of sufficiently different and the fact that in the case of amplicons you aren't really doing assembly. Each read is the full result so the "contigs" are going to be the longest reads and everything contained in them. That will likely leave lots of variation between the different reads. You could try the smash haplotypes parameters from the FAQ to see if that reduces these extra contigs. You could also take the contig with the most coverage/reads and (assuming the other contigs blast to it) and just keep that for downstream.

marisalim commented 5 years ago

For the example with <20x coverage, these are the correction/layout tables from the Canu report output:

category	original raw reads w/ overlaps	original raw reads w/o/overlaps
Number of Reads	19607	455
Number of Bases	11798318	116180
Coverage	11798.318	116.18
Median	608	0
Mean	601	255
N50	609	604
Minimum	206	0
Maximum	815	760

category	evidence reads	corrected		rescued
		raw	expected corrected	raw	expected corrected
Number of Reads	2721	63	63	13458	13458
Number of Bases	1672930	41048	40367	8136412	6593961
Coverage	1672.93	41.048	40.367	8136.412	6593.961
Median	616	649	638	608	529
Mean	614	651	640	604	489
N50	617	650	639	608	548
Minimum	348	636	634	261	201
Maximum	707	698	660	814	634

These values look very similar magnitude-wise to the other example with the 5000x contig. Ah, so for amplicons, does that mean that the coverage requirement to create a contig is lower since the majority of reads should already overlap over the full region (or most of it anyways)? After Canu, I use Minimap2 to map raw reads back to Canu contigs. In this dataset, 500-2500 raw reads map back per contig, so this would explain why the contigs seemed pretty good despite the low coverage values.

Thanks for the suggestion, I will check out the haplotypes parameter. Would you recommend any adjustment to the values in the FAQ on haplotype smashing (corOutCoverage=200 and correctedErrorRate=0.15)?

Thanks!!

skoren commented 5 years ago

The 40x is only a target for correction, when assembling anything, even a couple of reads are enough to build a contig. You can see the 40x in the expected correction column. These are corrected reads in the assembly though so each one of them could have been built from hundreds of other raw reads. That's why their accuracy will be high (even 1 is already relatively high accuracy) and when you map raw reads you get much higher sets of mapped reads. You would probably get just as good a result if you took the trimmed reads and picked the longest one.

As I expected, with amplicons canu tends to over-rescue a lot of coverage (over 6000x in your table). These reads seem shorter than the amplicon but it is a little surprising that the highest coverage contig is only a few reads in this case, I expect there are lots of short variants between these reads due to random noise preventing them from being collapsed. I'd try increasing the correctedErrorRate to see what happens, otherwise stick with post-filtering of all but the contig with the highest number of reads.

marisalim commented 5 years ago

Aha, I was wondering why the rescued coverage was so high. I will try adjusting the correctedErrorRate parameter.

Thanks for your help!

marisalim commented 5 years ago

Hi - I wanted to follow up with adjusting the correctedErrorRate parameter. My previous runs had correctedErrorRate=0.12 (that seems to be the default as I didn't set it in my canu command, although the Canu documentation says 0.144 is the default for Nanopore reads...), and I increased it to 0.16.

I was following the suggestion from the Canu FAQ:

The default is 0.045 for PacBio reads, and 0.144 for Nanopore reads.

For low coverage:

        For less than 30X coverage, increase the alllowed difference in overlaps by a few percent 
(from 4.5% to 8.5% (or more) with correctedErrorRate=0.105 for PacBio and from 14.4% to 16% 
(or more) with correctedErrorRate=0.16 for Nanopore), to adjust for inferior read correction. 
Canu will automatically reduce corMinCoverage to zero to correct as many reads as possible.

As you suggested, this had the effect of reducing the number of contigs per sample. Most of these contigs were closer to the target length of 650bp and the coverage increased as well. Unfortunately, this change only improved my consensus sequence for 1 of 3 samples (validated by blast alignment to Sanger seq for the same sample). So, perhaps 16% was too big a jump from a 12% corrected error rate and allowed too many differences? I'm not sure how sensitive this parameter is.

skoren commented 5 years ago

The default is 12% since version 1.8, it was 14.4 before that. It's good that the increased error rate collapsed more of the contigs down.

The consensus isn't going to change much and I wouldn't expect it to improve with the higher error rate. You would need to run something like medaka or nano polish to get an improved consensus from the assembly.

marisalim commented 5 years ago

Ok, thanks for all your help!

marbl / canu

Contig quality and read coverage help #1313