mcveanlab / mccortex

De novo genome assembly and multisample variant calling
https://github.com/mcveanlab/mccortex/wiki
MIT License
113 stars 25 forks source link

Overlapping paired-end reads #33

Closed tseemann closed 8 years ago

tseemann commented 8 years ago

I'm impressed by how cleanly mccortex installed and runs!

I got these warnings in thread

[16 Apr 2016 12:18:33-LOx][generate_paths.c:422] Warn: Reads may overlap in fragment: 151 + 151 > frag len min: 0; max: 1000

We get lots of overlapping PE reads from NextSeq and MiSeq due to suboptimal Nextera XT library prep.

Should I be concerned? Will this affect the results?

noporpoise commented 8 years ago

I've added a bit of information on the read threading wiki page.

If you don't merge the paired end reads when they overlap you'll see very few read pairs have their insert gaps filled. This means you may lose a lot of long distance connectivity information that is in the reads. In some cases it may increase the rate of errors in your graph links.

If you have a lot of overlapping read pairs and you can't merge them, I recommend only using single ended reads in the threading stage. This will reduce your contig N50 but you'll make fewer assembly mistakes.

tseemann commented 8 years ago

It's always possible to merge them using pear or FLASH etc, and end up with some unmerged PE and the rest in merged SE reads. I guess I worry about PE merging with respect to short exact tandem repeats (eg. CRISPR style).

Thanks for updating the wiki! Your docs are very thorough and they are helping me a lot to understand how to make use of mccortex.

michaelbarton commented 7 years ago

Piggy backing on this issue, do you have any intuition or benchmark data for how mccortex might perform as a single genome assembler, for example in comparison with spades? I looked in the benchmark folder and this appears focused around mixed samples, unless I have interpreted this wrong.