marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Assemble the lambda phage genome #552

Closed efratca closed 7 years ago

efratca commented 7 years ago

Hi, I am trying to assemble the lambda phage genome using a Oxford Nanopore dataset with a 2548.91 times coverage (15899 reads). Running canu-1.5 with the default parameters took for the correction step 40X coverage, 50 reads (I understood that by default, only the longest 40X of data (based on the specified genome size) is used for correction) and created 0 contigs. Then I used corOutCoverage=100 and got 1 contig with the length of 80247 bp (larger than the genome size). The assembly was aligned to the reference genome using Quast and revealed a missasembled contig. In addition, I tried to take randomly portion of the reads (4000 reads with 587 times coverage), corOutCoverage=400, and again I got one contig, length 66269 bp (larger than the genome size) and missasembled. Can you please explain me why I got assembly larger than the genome size and how can I improve my assembly to get correct contig. In addition, I wanted to ask about the default parameter of longest 40X coverage used for correction, how it handle a case when there are a few incorrect very long reads that might consist 40X coverage? Thanks

skoren commented 7 years ago

Two comments, the full coverage dataset is probably giving 0 contigs because the whole genome is contained in a single read. By default this gets filtered by Canu but you can set the filtering parameters to avoid this filtering: contigFilter="2 1000 1.0 1.0 2" (see the http://canu.readthedocs.io/en/latest/faq.html#my-asm-contigs-fasta-is-empty-why FAQ). This filtering will be improved in the next release to avoid filtering these types of short genomes.

As for the longest 40X is counted after trimming sequences so a few incorrect long reads wouldn't affect the assembly. When you say mis-assembled, what do you mean exactly? When the genome is circular the final contig can start at an arbitrary position in the genome and will wrap around the origin. The contig can also have an overlap of 1 read in the contig making it larger than the genome size. This can look like a mis-assembly but is actually correct sequence given a circular genome:

screen shot 2017-07-16 at 9 20 38 am

This is an example 9kb plasmid where the assembly wraps around the origin of the reference as well as having a 3k overlap (so the assembled contig is 12kb not the expected 9).

This is consistent with your output, when you use all data the assembled reads are very long so it is plausible there is a 30k overlap from a 30k read in the 80k contig, making it the correct size. As you reduce coverage you reduce the longest read and so you end up with a shorter contig and a shorter overlap. The GFA output will capture this information and list the overlap between the contig and itself to confirm its circularity and give the approximate position where you can trim the contig.

efratca commented 7 years ago

Thanks a lot for the reply. As you suggested I looked on the GFA output, and indeed in the case of running Canu on the portion of reads taken randomly, the GFA output (partly) was:

H   VN:Z:bogart/edges
S   tig00000001 *   LN:i:66269
L   tig00000001 +   tig00000001 +   14M4I102M........

I understood that the CIGAR string means that the position of the overlap between the contig and itself starts where there is an alignment with the first 14 bases of the contig, is it correct? Trimming the contig at that position left a contig with 47655bp lengh. Alignment this contig with the reference genome using Quast revealed a correct contig. In contrast, the GFA output in the case of using all the data (corOutCoverage=100) was:

H   VN:Z:bogart/edges
S   tig00000001 *   LN:i:82056

Thus in that case the overlap between the contig and itself, can't be seen, can you please explain that. In addition, I ran like you suggested all the data (default, corOutCoverage=40) with contigFilter="2 1000 1.0 1.0 2". Indeed I got then 1 contig at the length of 69777 bp. Alignment the assembly to the reference genome using Quast revealed a correct contig in the Contig size viewer and a portion of a correct contig and also a portion of missasembeled blocks in the Contig alignment viewer. This is probably because of the overlap of the contig with itself, but again in that case the overlap wasn't showed in the GFA output.

In summary, can you please clarify if the way I understood the overlap where the link line was in the GFA output is correct and why in the other cases the link line was missing. In addition, what is the recommendation when the coverage is high, to take randomly portion of the reads or to take all of the reads? to use the default corOutCoverage=40 or to increase it? Finally, I want to understand when the longest 40X reads are chosen. You wrote that the longest 40X is counted after trimming sequences but it was written that Canu tries to correct the longest 40X of data and also from the report of running Canu it seemed that the 40X are chosen before the correction step. Is there an additional trimming step before the correction step? Thanks

skoren commented 7 years ago

The cigar string indicates the alignment starts with 14 matches, 4 insertions, 102M, etc so it is longer than 4bp but, yes there should be an L line in your GFA file. Are you running the release 1.5 version? There's been improvements in this code to detect and properly mark circular contigs in the latest code so it should hopefully capture the edges missing in your output. If you are able to share your data, we can test with the latest code and confirm it fixes the missing link.

As for the trimming and 40x, it is typically OK to give all the data and let Canu choose the longest 40x. If you have extremely deep coverage (>200x) you can take a random 200x subset and run with that to save compute time but using all data should still work. The 40x is chosen before correction but after computing overlaps so Canu can accurately estimate the corrected read length accounting for read quality. That is, the long reads which have no overlaps aren't considered "long" anymore. Reads with no coverage in the middle (e.g. chimeric reads) are also split in the estimate for their length. So the 40x is based on estimated trim points of the raw reads.

efratca commented 7 years ago

Thanks a lot for the answer. Yes, I am running canu-1.5, is there a later code? Otherwise do you have any suggestions why the link line is missing. Other thing, I am running another project and the GFA output is: H VN:Z:bogart/edges S tig00000001 LN:i:2296736 S tig00000002 LN:i:1230690 S tig00000005 LN:i:603201 S tig00000006 LN:i:705049 S tig00000009 LN:i:293267 S tig00000011 LN:i:248542 S tig00000013 LN:i:272127 S tig00000015 LN:i:162082 S tig00000017 LN:i:133012 S tig00000018 LN:i:205129 S tig00000019 LN:i:171842 S tig00000022 LN:i:25387 S tig00000024 * LN:i:12548 L tig00000017 + tig00000017 + 20M2I14M1I24M1I31M1I1M1I43M1I2M1I11M1D1M1D52M1D56M1D3M1D8M2I50M1I38M2D45M2I169M1I2M1I18M4I3M1I1M2I83M1I1M2I1M1I71M5I80M1I56M1I166M1D42M1D72M1I16M1I63M2I1M2I103M1I2M3I22M3I15M1D1M1D1M2D8M1D16M1I114M1I88M1I68M1I38M1D3M2I8M1D72M1D20M1D13M1I143M1I9M1D156M1I113M2I139M1D4M2I233M1I8M1I1M1I62M1D1M1I8M1I1M2I45M1I1M1I12M1I1M1I60M1I43M1I13M1D43M1I9M1D28M1I54M2D1M1D6M1I4M1I1M1I165M1I44M1D56M1I1M1I34M1I1M3I16M2I41M1I35M1I62M1I1M1I197M1I43M1I5M2I19M1I2M2I120M1I24M1D4M1I90M1I16M1I92M1D41M1D2M1D20M1I175M1D1M1I49M1I1M1I30M1D7M1I53M1D67M1I20M1I2M1I48M1D70M2I25M1D2M1D22M2D1M1D24M2D2M1D45M1D44M1D42M1I51M1D1M1D45M1D2M1D44M1D11M1D3M1D3M1D21M1D37M2I1M1I76M1I135M2I6M2I6M2I4M1I12M2D13M5D1M1D1M1D146M2D18M2I26M1I10M2D1M1D1M1D59M1I4M1I20M2D6M2D34M4D82M1I39M1I77M1I11M1I11M1D16M1D25M1D24M1I36M1D126M1I14M1D56M1I34M1D104M1D4M1I8M2D102M1D3M1D19M1D48M3D40M1I32M1D18M1D23M1I42M1I101M2D4M1D26M2I38M2D1M1D77M1D3M1D26M2I14M1D7M1I183M1D1M1I15M4D27M1D80M1D18M1D8M2D32M1D56M1D86M4D1M1D24M1D50M1D43M1D12M1I1M1I74M2D13M2D18M2I9M2D1M2D29M1I7M1D12M1D2M1I3M1D3M2D10M1I73M4D21M1I2M1I7M1I10M1D2M1I17M1D51M2D2M1D1M2D43M3D11M1I50M1D47M1I19M1D29M L tig00000017 - tig00000017 - 20M2I14M1I24M1I31M1I1M1I43M1I2M1I11M1D1M1D52M1D56M1D3M1D8M2I50M1I38M2D45M2I169M1I2M1I18M4I3M1I1M2I83M1I1M2I1M1I71M5I80M1I56M1I166M1D42M1D72M1I16M1I63M2I1M2I103M1I2M3I22M3I15M1D1M1D1M2D8M1D16M1I114M1I88M1I68M1I38M1D3M2I8M1D72M1D20M1D13M1I143M1I9M1D156M1I113M2I139M1D4M2I233M1I8M1I1M1I62M1D1M1I8M1I1M2I45M1I1M1I12M1I1M1I60M1I43M1I13M1D43M1I9M1D28M1I54M2D1M1D6M1I4M1I1M1I165M1I44M1D56M1I1M1I34M1I1M3I16M2I41M1I35M1I62M1I1M1I197M1I43M1I5M2I19M1I2M2I120M1I24M1D4M1I90M1I16M1I92M1D41M1D2M1D20M1I175M1D1M1I49M1I1M1I30M1D7M1I53M1D67M1I20M1I2M1I48M1D70M2I25M1D2M1D22M2D1M1D24M2D2M1D45M1D44M1D42M1I51M1D1M1D45M1D2M1D44M1D11M1D3M1D3M1D21M1D37M2I1M1I76M1I135M2I6M2I6M2I4M1I12M2D13M5D1M1D1M1D146M2D18M2I26M1I10M2D1M1D1M1D59M1I4M1I20M2D6M2D34M4D82M1I39M1I77M1I11M1I11M1D16M1D25M1D24M1I36M1D126M1I14M1D56M1I34M1D104M1D4M1I8M2D102M1D3M1D19M1D48M3D40M1I32M1D18M1D23M1I42M1I101M2D4M1D26M2I38M2D1M1D77M1D3M1D26M2I14M1D7M1I183M1D1M1I15M4D27M1D80M1D18M1D8M2D32M1D56M1D86M4D1M1D24M1D50M1D43M1D12M1I1M1I74M2D13M2D18M2I9M2D1M2D29M1I7M1D12M1D2M1I3M1D3M2D10M1I73M4D21M1I2M1I7M1I10M1D2M1I17M1D51M2D2M1D1M2D43M3D11M1I50M1D47M1I19M1D29M I understood, like in the former example, that contig: tig00000017 is overlap on itself and the starting of the overlap is in alignment of the 20 first bases, but when looking for match of the first 20 bases, I didn't find such a match. Can you please explain that. Finally, I understood that it's OK to run the default corOutCoverage=40, but when trying on the second project corOutCoverage=40 and corOutCoverage=100 corMhapSensitivity=normal I got different in the contigs number between the two assemblies. Do you have any explanation for it?

Thanks

skoren commented 7 years ago

You can try the latest unreleased code by cloning the github repo rather than downloading a release, following the instructions here: http://canu.readthedocs.io/en/latest/ under install. Or you can share the data and we can run it locally.

The alignment is output as a CIGAR string, see https://samtools.github.io/hts-specs/SAMv1.pdf for details. You can convert it to an alignment by following the operations, using the end of tig00000017 as the reference and start of tig00000017 as the query. The total alignment length is much larger than 20 bases. You could also align the contig to itself using mummer to find the exact positions.

As for assembly differences, yes you are outputting more bases and using different overlapping parameters (corMhapSensitivity) so the assembly won't be identical. The length of the largest contig will probably change and the overlaps in the circular contigs will change as well since those depend on the corrected read length but if you aligned the two assemblies to each other I expect they will largely match.

efratca commented 7 years ago

Thanks a lot for the answer. I ran the data of the lambda phage with the latest unreleased code and I got this time 2 contigs (instead of 1 contig) that seem to be overlapping each other. The GFA output is: H VN:Z:bogart/edges S tig00000007 LN:i:54385 S tig00000008 LN:i:53254 L tig00000008 - tig00000007 - 7M1D2M2D12M1I3M2I1M2I23M2I56M2I33M3I26M6I48M1I15M1I8M1D14M2I2M4I1M1I22M1I38M1I4M3I36M1D15M1I46M1I26M3I2M3I1M1I126M1I1M1I56M1D8M1D2M1D5M2D1M3D2M11D1M3D5M7D2M1D2M1D1M2D1M5D4M1D2M1D1M2D2M3D1M6D1M3D2M1D3M3D1M4D2M2D2M5D1M6D3M6D22M1D14M1D74M1D1M1D16M1D109M3D11M1I109M1D22M1D3M1D9M1I99M7D98M3D58M2I77M2I40M1I122M1D2M1D83M1I26M1D1M5D13M2I18M1D74M4D43M2D59M1I184M2D8M1I1M2I104M1D13M2D64M1I4M1I73M1D26M1D2M1D40M1D25M2D25M1D51M1D1M2D119M1I230M2D108M2D1M1D23M1D1M5D24M1D83M2D1M1D28M4D150M5D47M3D2M1D1M1D13M1D9M1D13M2D12M1D69M2I2M1D35M1I83M2D1M1D187M4D12M1D40M1D2M1D14M1D86M1D1M2D111M1D6M1D1M1D67M4D72M1D1M1D3M2D38M1D27M1D8M1D85M2D40M1D45M4D5M2D33M2I4M1D42M4D17M2D15M4D9M1D34M1D1M2D35M1I35M1I28M1D45M2I56M1I201M1D78M4D50M1I72M1I160M2D17M2D1M1D100M1D1M1D129M1D1M1D59M1D95M1D71M1I204M4D125M2D100M2D45M1D197M1I65M1D90M1I134M1D81M1D15M3D117M3D132M1D4M1D34M1D2M1D54M1I4M1D19M3D82M1D3M1D2M1D54M1D3M1D15M1D72M1I3M1I2M1D70M1D44M1I22M1I15M1D104M1D53M1I166M1D1M1D10M1D2M1D54M2D265M1I17M1D82M1D48M2I58M2D33M2I2M1I28M1D1M1D11M1D76M1I8M1D153M2D386M3I78M2D144M2D68M1D2M1D54M1I52M2D7M1D1M1I7M1D29M1D161M1D3M1D1M2D1M1D89M1D50M1D23M1D81M2D1M1D19M1D13M2D3M1D1M1D141M2D71M2D1M1D122M1D1M1D1M2D53M1D1M1D290M2I20M4D47M2D50M2D2M2D20M1D2M2D49M1D2M1D48M4D149M1D46M1D151M1D1M1D101M1I2M1D4M2I9M1D92M1D143M1D94M2D61M1D53M1D1M1D4M1D116M1D10M1D4M3D146M3D1M1D50M5D58M1D45M1D23M1I170M1D12M1D102M2D1M1D31M2D30M1D2M1I435M2D1M1D3M1D3M1D7M1D232M1D93M1D99M2D19M1D88M1D40M2D1M3D171M1D49M1D52M3D19M1I11M1I41M2D2M1D78M3D2M3D50M1D48M1D47M1D58M1D17M2D197M2D64M1I1M1I1M1I231M1D12M1D143M1I44M2D38M2D127M2D18M1D120M4D35M1D71M1D91M1D43M3D2M1D197M1D14M3D32M1D6M1I4M2D2M3D14M1D1M2D293M1D138M2D14M1D30M1I5M1D23M1D119M1D239M2D75M2D97M2D1M1D57M1D38M1D45M5D77M1D30M1I60M1I9M1D25M1D23M1D29M1I62M5D8M2D10M1D20M2D1M1D168M2D26M1D127M1D18M1D10M1I68M2D84M1D16M3D22M2D21M2D2M1D36M1I122M2D1M1D5M2D8M1D1M1D25M2D11M1I112M1D57M1D2M2D25M3D87M1D2M1D89M3D145M3D4M3D30M1D1M3D1M1D3M2D10M1I11M1D15M3D53M1D94M1D87M3D115M5D17M2D1M1D35M4D54M3D53M1D1M2D3M3D8M1D56M2D55M2D49M1D24M1D4M2D1M1D12M2I36M1D30M2D50M2D43M1D1M1D32M1I3M4D19M1D28M1D1M1D85M2D70M2D1M2D3M1D7M2D22M1I28M1I9M1D63M2D11M2D41M1D10M1D59M4D103M1D4M1D2M3D64M1I9M2D16M3D1M1D9M2D10M1D94M1I7M4D59M2D55M1D65M1I6M1D112M1I25M1D46M1D39M1D24M2D1M1D27M1D27M1I161M2D2M3D21M2D1M1D76M1D88M1D22M3D1M3D55M2D1M6D1M1D290M2D1M1D30M1D1M1D83M1D53M2D13M1D1M1D80M1D1M3D2M1D1M4D3M1D2M1I2M1D2M1D5M2D46M1I11M1D14M4D20M2D18M1D2M1D1M7D15M4D25M1D43M1D4M1D1M3D5M1D3M2D2M1D3M3D8M1D27M1D41M2D1M1D1M1D6M1D38M3D25M1D5M1D86M1D82M3D1M2D33M1D7M1D78M1I35M2D99M1D43M1I7M1I30M1D63M1D2M1D12M1D39M2D1M2D1M1D41M4D49M1D1M1D107M3D14M1I116M1D56M1D15M1D14M2D30M2D14M1D4M1D68M2D1M2D77M1D54M2D1M5D16M1D38M2D25M1D81M2D84M2D14M2D29M1D8M2D29M5D31M1D12M1D88M1D2M1D172M1D31M1I27M3D12M1D2M3D105M1D3M1D107M2D28M1D123M1D57M2D45M2D16M2D54M3D138M2D100M2D8M2D30M1D15M3D164M1D31M2D52M1I16M5D1M1D57M3D4M1D23M1D31M1D29M1D66M1D40M3D14M2D2M1D128M1D3M2D71M4D62M1I93M3D31M6D30M1D311M1D2M2D26M4D59M1D57M3D57M1D39M1D2M1D1M2D1M2D89M2D141M1D27M1D2M1D46M2D1M4D43M3D24M4D16M1D98M3D19M1D29M2D33M1D47M4D62M1I68M3D1M1D55M1D142M6D2M1D163M1D97M3D26M1I48M1D81M1D2M1D21M4D262M1D2M1D1M1D1M7D1M1D59M3D2M3D16M2D52M1D19M2D171M3D73M3D129M1D4M3D127M1I9M1D5M1D1M3D1M2D24M1D68M2D98M4D86M1D109M1I9M1D73M1D67M5D5M2D32M1D8M2D71M2D26M2D1M1D1M2D25M1D8M1I9M2D3M2D66M1D111M1D108M1D10M1D46M1D5M1D39M1D1M1D9M1D6M3D13M3D10M1D1M1D1M2D169M1D48M1D24M1D1M1D123M1D49M1D52M1D128M1D3M1D49M3D1M1D1M1D30M1D164M1D8M5D35M1D17M3D44M1D45M4D1M3D7M1D1M2D193M1D172M2D30M1D109M2D2M2D119M1D3M1I73M1D73M1D36M2D18M4D1M2D11M1I90M1D17M1D1M2D81M3D38M1D5M1D1M1D122M1D4M3D8M3D28M5D72M3D16M5D15M2D12M1D12M2D44M4D72M2D70M1D16M1I13M3D28M1D11M4D85M2D6M1D8M4D39M1D56M2D2M1D55M3D27M2D3M1D18M1D113M1D57M1I118M1D24M2D5M1I12M2D87M1D51M4D36M1D7M1D36M1D56M1D1M1D2M2D31M1D14M1D43M1D25M1D21M2D91M1D20M1I22M1D1M7D1M1D21M1D17M1I159M1I14M1D67M1I30M1D29M1D38M2D1M2D1M1D37M1I2M1I9M1D90M2D91M1I24M1I87M2D2M1D2M4D38M1D24M2D15M1D20M1D53M1D46M1D13M1I11M2D51M2D8M3D1M3D106M3D40M4D35M2I5M1D18M1D94M2D226M1D9M1D5M1D65M2D94M1D73M1D69M1D58M1D28M1D5M1D1M3D1M2D45M1D67M1I62M1D129M1D23M1D52M2D119M3D2M1D45M1D5M3D23M2D29M1D8M2D15M1D1M8D30M1D40M1I9M2D37M3D87M1D10M1I13M1D10M1D90M1D78M3D27M4D49M2D2M1D5M1D193M1I63M1D110M1D32M1I4M2I2M Another thing, about the second project, the Oxford Nanopore dataset was with around 70 times coverage, running with the default corOutCoverage=40 left for the correction step 26.39 times coverage, and running with corOutCoverage=100 corMhapSensitivity=normal left 38.84 times coverage. Why the coverage was not 40 and 100 respectively? and in such case of around 26 times coverage in the correction step is it recommended to use a greater corOutCoverage than the default 40, like corOutCoverage=100? Thanks

skoren commented 7 years ago

There's always some loss, typically 30%, in the correction and you don't get exactly 40x unless you have very good quality data. So it is OK to keep the defaults for most cases.

efratca commented 7 years ago

Thanks!