marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

bad contigs being assembled that map to multiple chromosomes #1125

Closed ryandrosophila closed 5 years ago

ryandrosophila commented 6 years ago

I have 2 flowcells of nanopore data for a Drosophila species and I have been repeatedly assembling decent sized, but erroneous contigs that map to multiple chromosomes. To try to get to the bottom of this I have been using version 1.7 and various corrected error rates (0.122 - 0.166) and subsets of the data to change coverage. I typically use the -fast option. Using all the data is around 100x, reads >18kb is ~50x , and >20kb is ~30x. I also just tried version 1.8 with error 0.122 and >20kb reads and am still getting contigs that map to multiple chromosomes. Oddly, a minimap2/miniasm assembly of the exact same data produces better assemblies that just break at repeats (like expected for that assembly method) and map to single chromosomes. In other Drosophila species where I have had comparable data, I typically get near chromosome length contigs with canu.

example canu command:

canu -p Dsub_canu_v1.7_err_0.144_fast_18kb -d Dsub_canu_v1.7_err_0.144_fast_18kb -fast correctedErrorRate=0.144 genomeSize=150m useGrid=false -nanopore-raw run23_and_24_18kb.fastq.gz

When I map the nanopore reads to these canu assemblies and their contigs I find few, or no reads spanning the regions where the assembler is stitching together some of the unitigs. I have done a lot of canu assemblies with pacbio data and nanopore data and haven't ran into this before and I am at a loss. Any suggestions would be greatly appreciated.

Stats for a canu version 1.7, err 0.144, reads >18kb assembly: N50= 950,083 MAX CONTIG SIZE= 7,726,782 (but a bad contig) GENOME SIZE= 146,205,271 NUMBER OF CONTIGS= 570

Stats for a minimap2/miniasm assembly of >18kb data: N50= 2,435,266 MAX CONTIG LENGTH= 11,130,435 GENOME SIZE= 144,892,919 NUMBER OF CONTIGS= 518

I am not entirely sure what parts of the output would be most useful for diagnosing the issue so please let me know and I will attach.

Thanks for the help!!!

skoren commented 6 years ago

A couple of suggestions would be to use all the data and let Canu pick the best longest reads. By selecting longest reads by length you may be selecting for artifacts. You should also try the default instead of the -fast option to see how it compares. If you're able to share the data (see FAQ for sending it to us), we can run it locally as well.

ryandrosophila commented 6 years ago

Sounds great! I'll try without -fast and with all the data, it just takes a bit to get an answer because of computational limitations (24 cores, 126GB of memory). I have done runs with all the data (~100x and corrected error 0.144, -fast) and found I was still getting odd results. I also did two independent assemblies, one for each flowcell, and found them to be rather inconsistent with each other, and neither with particularly good stats. I can definitely share the data with you. I'll send it along. I thought it might be a data problem at first but the stats for the reads look okay and the minimap2/miniasm assembly looks fine.

Thanks for the help and quick reply!

ryandrosophila commented 6 years ago

Data has been transferred. It is named Dsub_run23_and_run24.fastq.gz

skoren commented 6 years ago

I ran a default assembly and got a max contig of 11mb and N50 of 2.7mb. What reference are you using to compare with? It seems quite diverged from D. melanogaster so the alignments aren't very informative. I also ran wtdbg and it got a similar assembly (max=7.7, N50=2.0mb) and the two assemblies show good structural agreement with each other.

ryandrosophila commented 6 years ago

Hmmm. Interesting! The closest species is D. guancha. The genome is here and there should be good agreement, just a number of inversions.

http://denovo.cnag.cat/genomes/dgua/

Your canu stats sound very similar (but better) than my minimap2/miniasm assembly. My canu assemblies without the -fast option are still running but are close. I tried version 1.6 and version 1.7 (24 cores). I am wondering if the -fast option could be causing this?

ryandrosophila commented 6 years ago

Is it possible for you to share the canu assembly you did? I am very interested in looking at it and comparing to my other assemblies.

skoren commented 6 years ago

You can download it here: https://obj.umiacs.umd.edu/sergek/shared/dNano.contigs.fasta

I do see some alignments of contigs between chromosomes but the identity is still pretty low (below 90%) so I'm not sure if these are mis-mapping in repeat areas or missing regions in the reference. The fast assembly is less continuous (N50 700kb) and doesn't have these cross-chromosome mappings as I would expect for any less continuous assembly. You could also use the unitigs which are less continuous and more conservative. Another option is to turn down the identity allowed by bogart, using the heterozygous parameters from the FAQ: batOptions="-dg 3 -db 3 -dr 1". I also posted that assembly: https://obj.umiacs.umd.edu/sergek/shared/dNano.conservative.fasta

That said, any assembly may have some errors so if you see a region that has low raw read support across it you could break there or use some other datatype like HiC or bionano to identify/fix errors.

ryandrosophila commented 6 years ago

Thanks so much for doing this! Yeah, I took a look at your assemblies and I am seeing the same problem that I was having before, just not as bad. The conserved assembly is better, but the largest contig still maps to multiple chromosomes. I don't think the problem is from repeats because the regions where things are breaking aren't overly repetitive in either species.

I took the canu corrected reads from one of my assemblies and mapped those to the D. guanche assembly and noticed that I seem to have a non-trivial number of chimeric reads in the corrected reads. Further, the longer reads are enriched for chimeric reads. These are pretty obviously chimeric reads because part maps cleanly to one chromosome and then the other part maps cleanly as a supplemental alignment to another chromosome. I don't know the source of the chimeric reads (library prep issues or sequencing issues).

I am trying to figure out a way to weed these reads out since I have a hunch these might be messing up the assembly. Thoughts? Would having 1-5% of error corrected long reads being chimeric artifacts really mess with the assembly?

Thanks again for all the help.

skoren commented 6 years ago

The trimmed reads are assumed to be correct so a single one is sufficient to build a contig. If you've got 1-5% of chimeric artifacts they can definitely confuse the assembly.

However, if you've got these artifacts in the corrected reads I expect they're also present in the raw reads as well. I'd trust several long reads all indicating the same chimera, I don't know a systematic error that would introduce cross-chromosomal chimera in nanopore sequencing. In that case, I'd probably trust the chimeric sequences over the old assembly.

ryandrosophila commented 6 years ago

Ok. I am thinking these chimeric reads must be part of the problem. They seem to be randomly distributed and don't show any pattern consistent with them being real. For example, if I look at my reads mapped to a genome, the chimeric reads will all have different start/stops and map to different chromosomes. So basically half the read maps to the correct chromosome and the other half maps to something else.

The weird thing is that when I look at the raw reads and compare those to the error corrected, only a subset of the chimeric reads make it through to the final error corrected reads set, but there doesn't seem to be much evidence that they should.

If there is a way to share images with you I will. I have a number of IGV screenshots showing the raw and corrected reads mapped to the D guanche genome and it would make clear what I am talking about.

Regardless, I am going to try to weed out these reads and see what the genome assembly looks like.

Thanks again for all the help.

skoren commented 6 years ago

Interesting, the correction/trimming is based on random error so if you have multiple chimera supporting the same junction they would likely be kept. Are there any or are they all randomly distributed? You should be able to post images by attaching them to this issue.

ryandrosophila commented 6 years ago

Here are a few of the pictures. The first shows the canu contigs compared to just one of the D. guanche chromosomes and highlights two missassemblies. The other pictures are IGV screenshots showing the corrected reads mapped to my canu assembly. The corrected reads were mapped using minimap2 and the heavily colored portions show where the read is softclipped. The softclipped portions often map to other chromosomes. You can see they tend to be somewhat random. I zoomed in on regions where the canu assembly seems to be putting together pieces of different chromosomes. Obviously, there is pretty low support for some of these junctions.

chro_coverage_in_canu_assembly png

break_point1

break_point2

break_point3

break_point4

skoren commented 5 years ago

The interesting thing about those regions is there are at least two corrected reads supporting all the joins. The default support is only one other read in Canu but it can be increased. I ran an assembly with some stricter parameters and here is the alignment plot I get. asm contigs

I don't see contigs crossing chromosomes so I am guessing these problematic reads are trimmed. The assembly is actually a bit more continuous as well (max=13.9mb, n50=4.8mb). The parameters I used are:

canu -d test -p 'asm' -pacbio-raw Dsub_run23_and_run24.fastq.gz corMhapSensitivity=normal genomeSize=150m correctedErrorRate=0.065 corMinCoverage=8 batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50 trimReadsCoverage=4 trimReadsOverlap=500

I think the only important ones are trimReadsCoverage=4 trimReadsOverlap=500 but I haven't tested that yet.

skoren commented 5 years ago

As I suspected, the assembly with defaults:

canu -p asm -d test -nanopore-raw Dsub_run23_and_run24.fastq.gz correctedErrorRate=0.065 trimReadsCoverage=4 trimReadsOverlap=500genomeSize=150m

also seems to produce good contigs. The trimming is too permissive given your read chimera rate. The stats are a bit worse (max=9.7mb N50=3.4mb) than the previous one I posted. I posted both the default and the above assemblies here: https://obj.umiacs.umd.edu/sergek/shared/dNano.tar.gz

We've wanted to revisit trimming in the next Canu release and this will be a useful test case.

skoren commented 5 years ago

I'm going to close this and track through our project as a feature for the next release (trimming tuning). If you do find out why this dataset seems to have generated a higher than normal rate of chimeric reads, let us know.

ryandrosophila commented 5 years ago

Sounds great! It definitely seems like your recommendations fixed the issue. I think what might be causing the increase in the number of chimeric reads in our data was that the person who was preparing the libraries extended the adapter ligation time which might have caused some ligation issues. It looks like the longer the read, the more likely it is to be chimeric and interestingly, there isn't adapter in the middle of these chimeric reads, so Poretools doesn't find them or split them.

Thanks again for all your help!!!! I really appreciate it!

On Tue, Nov 27, 2018 at 6:45 AM Sergey Koren notifications@github.com wrote:

I'm going to close this and track through our project as a feature for the next release (trimming tuning). If you do find out why this dataset seems to have generated a higher than normal rate of chimeric reads, let us know.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/marbl/canu/issues/1125#issuecomment-442083683, or mute the thread https://github.com/notifications/unsubscribe-auth/Aj2krPSuHrgM_X1iPNPFdJDPgbdMhDTrks5uzVAZgaJpZM4X6xRi .