isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
268 stars 48 forks source link

Illumina paired end reads worsens results #128

Open TGatter opened 5 years ago

TGatter commented 5 years ago

We try to polish a rather fragmented genome assembly with racon. So far we are testing thing on Yeast S288C, so we have a decent reference to check for performance. We use very low coverage nanopore reads to create unpolished contigs. We attempted to polish the data with 150bp paired-end Illumina-Miseq reads.

We run the following pipeline at standard options:

minimap2 -ax sr [unpolished contigs] [Illumina 1] [Illumina 2] > [sam]
cat [Illumina 1] [Illumina 2]  > [Illumina Joined]
racon  [Illumina Joined] [sam] [unpolished contigs]  > [polished contigs] 

Using this setup, we see noticeable worsening of the results. We map against the reference using Quast. Before Racon: Before After Racon: After

Please notice how especially local mis-assemblies especially increase, but also translocations and relocations.

We would greatly appreciate your opinion on this effect.

rvaser commented 5 years ago

Hello,

could you please verify if your paired end reads have equal names up to the first white space?

Best regards, Robert

TGatter commented 5 years ago

Yes, this is the case.

File 1

[thomas@vodka experimental]$ head -n 12 /scr/k70san2/thomas/yeast_nanopore/YeastStrainsStudy/fastqs/miseq/s288c_1.fastq
@ERR1938683.1 MS5_18429:1:2113:11939:14368#1/1
GTGGTGTGTGGTGTGGTGTGTGGAGTGGTGGGGGGGTGGGGTGGGTGGGTGGGGAGTGGGGTGGTGAGGGGGTGTGGTGTGTGGGGGGGTGGGTGGGGAGGGGGGGGGGTGGGGGGGTGGGGGGGGGGGGTGGGGGGTGGGGGGGTGGGG
+
AAAAAAAA1CA1FFEEFFAEFGG101BF100A//<<---9@.99?99--.;-=;---9-9---9:-/--9---9--;--;9-9---;-;-----;----;--------9-@->-;---;----9-;----9-;9=---9-----------
@ERR1938683.2 MS5_18429:1:2106:27932:17792#1/1
GTGTGTGTGTGGGTGTGGTGTGGTGTGGGGGGTGGGGGGTGGGGGTGGGGGGGGGGGGCGGGGGGGATGTGGGGGCGGGTGGTGGTGGGGTGTGGGGGGGGGGGAGGGGGGGCGCGGGGATGCGCGAGGTGCGGTCGTGGGGTATGTTGT
+
AAAAAFFBCF>AEEEEEE0AAAE0AAE0B0//<E////9-;-A--;@--9@-9----;-9@@------:/9/--9-->-------99----9-----------;-9-------;-------9-9-9--9--/-----9----;-///9/-
@ERR1938683.3 MS5_18429:1:2104:19853:5516#1/1
CACCACACCCACACACCCACACACCACACCCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCCCACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTCCCCAACCTGTCTCTCAACTTA
+
BBB@BBBBBBBBGGGGFGGGGGEGGD222EEAECE2FGA1AEFGAE?FEGGCEAEGGGEEGDHAEEEE1GHHGGH11EE?/>EE/EEA/</BFCB2222FB??FC22GH2F/CA<C2<0?F1?1G0??///<0.<<001>1F=<G0<=0D

File 2:

[thomas@vodka experimental]$ head -n 12 /scr/k70san2/thomas/yeast_nanopore/YeastStrainsStudy/fastqs/miseq/s288c_2.fastq
@ERR1938683.1 MS5_18429:1:2113:11939:14368#1/2
ACACACCACACCCACACCCACACACCACAACCACGCCCCACACCCAAAAACAGACCAACACCACGCCCGCCGAACCACTACACAAAACACAAACCCCCACACCAACACAGCACCCCAACCATCAACAACACACCCCCACCCTCCTAAACC
+
AAAAAAAAAAAAAAEECEGGGG?E0EA0F00/AF//////AE///AFF////000?0?//B//?///>/////>>/0/10010B/F00/0/<0//?/////<///////0/00?/><....1>><01....<...---..:..:/;;0/9
@ERR1938683.2 MS5_18429:1:2106:27932:17792#1/2
CACACACCACACCCAACCCCAAACACCACCAATCCCCACAGAGGCACGAAGAGCAAAAGGGCCAGACACCGCGCACACCCCAACACAAAAAAACTCAGAAACAAGGTATCGAACAAAATAATTAGCAACATCTCGCAAATACGACACCAC
+
1>AA?1AAAAAAG1110A0A000000B0B00AB00B/B/A010/B/0//A//00AB100//////0/0BF/>///>/0?/>//>F///00?///B111B1100/0/112>//00/0?011111?1011000<110>-.-000...<-.;.
@ERR1938683.3 MS5_18429:1:2104:19853:5516#1/2
GTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGGCAGGGTAATGGAGGGTAAGTTGAGAGACAGGTTGGCCAGGGTTAGATTAGGGCTGTGTTAGGGTAGTGTTAGGATGTGTGTGTGTGGGTGTGGTGTGGTGGGG
+
ABABAABA?ADBCFGGGFFFGEFHHHCGHHGGGGFEGFAFAEFGEAEDFEGCFEBGHFGGGE?CFGGFH5CGHGGFHG3@GGF3FGHEGH1FBGHDGFHFAFHGFFFHGFB3?BFBB?EGFDG2BFGGFCGBGGG/?/?/0???/<?/A<
rvaser commented 5 years ago

Well that is probably the problem. Racon expects that all reads have unique identifiers. In your case, only half of your Illumina reads will be used in polishing. You should add something to the identifiers to distinguish them and then run minimap2 and racon again.

On the other hand, what coverage of long reads do you have? I would advise one iteration of long read polishing before short read polishing.

TGatter commented 5 years ago

Thank you. I indeed did not think about this problem. As it seems, minimap2 and racon both remove the /[0-9] suffix, as they consider only the id to the first space, and minimap2 will remove this in any case. However, this means it is not possible to run minimap2 in paired-end mode, but rather needs to be run as single-end on the joined file with changed ids. Is this correct or is there we workaround I do not see?

rvaser commented 5 years ago

Yes, you will have to run single-end mode. Not sure if you will lose much accuracy in comparison to paired-end mode.

TGatter commented 5 years ago

I now joined both Illumina files and assigned running ids for each read. The results now look better, but are overall still worse. This is consistent, independent of whether running a correction step with nanopore reads before or not.

After2

rvaser commented 5 years ago

Hmm, quite odd that it decreases the accuracy. Would you mind sharing the data so I can investigate locally what is happening? Did you maybe try any other polisher in the same setting?

TGatter commented 5 years ago

I have tested the same setup using PILON, seeing only improvements. I tested even the same alignment files for both Racon and PILON, Racon worsening results but PILON improving them.

The assembled contigs ("contigs.fa"), as well as the true reference of Yeast S288C can be found in the attached archive. sequences.tar.gz

The Illumina reads used can be downloaded here: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR193/003/ERR1938683/ERR1938683_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR193/003/ERR1938683/ERR1938683_2.fastq.gz

Thank you for your help.

rvaser commented 5 years ago

Thanks, will check it out!

rvaser commented 5 years ago

I have run your data with both Racon and Pilon, and evaluated the results with dnadiff (from the mummer package). The assembly is quite fragmented, but I am not sure what is causing the increase in breakpoints of Racon polished contigs even thought the number of SNPs and indels is smaller (and higher average identity). I tried different alignment parameters but the result is almost the same. I'll try inspecting the poa graphs to see what is happening.

Length 1-to-1 alignments Breakpoints SNPs Indels Avg Identity (%) Complete BUSCOs (%) Fragmented BUSCOs (%) Missing BUSCOs (%)
Layout 14859762 659 2090 55076 216744 96.51
Pilon 14998813 661 2130 21416 42462 98.54 4.2 3.8 92.0
Racon 14790137 851 3588 13644 35218 98.99 5.6 2.9 91.5

Were the long reads and the Illumina reads obtained from the same strain?

TGatter commented 5 years ago

Hi. Yes. The organism/strain is exactly the same (Saccharomyces cerevisiae s288c). The fragmentation of the contigs is due to our method in development that is still missing a few components. However, I do not understand how this could break racon. Thank your very much for spending time on this problem.

rvaser commented 5 years ago

I have updated the table above with BUSCO scores (found with saccharomyceta_odb9). It seems that the breakpoints do not have that much of an impact I guess, not sure though.

apredeus commented 5 years ago

Hello. I've recently ran racon/illumina polishing and seems like the latest racon version I have installed (1.3.3) is doing something strange with the assemblies. My bacterial long-read (flye) assembly got 25 kb shorter on the first round of polishing. I've done similar things before and changes are usually in the range of several hundred nt. And a bigger (600 Mb) genome assembled by flye or canu has number of BUSCOs go down substantially after polishing.

rvaser commented 5 years ago

Shorter assemblies after polishing are not that uncommon. Although, I am not sure why the BUSCO scores would drop. Did you try Pilon in the same setting?

apredeus commented 5 years ago

Yes, and the scores have increased. That's why I became fairly certain something is wrong.

rvaser commented 5 years ago

No idea why. Nothing special changed between 1.3.2 and 1.3.3. If you had PE reads, did you combine them as described above?

apredeus commented 5 years ago

Yes I did forget to replace the space in the concatenated file :( very embarrassing. Sorry for bothering you over nothing.

rvaser commented 5 years ago

Haha, no problem! :)