mahulchak / quickmerge

A simple and fast metassembler and assembly gap filler designed for long molecule based assemblies.
GNU General Public License v3.0
200 stars 31 forks source link

Unexpected sequence in merged fasta #53

Closed mrvollger closed 4 years ago

mrvollger commented 4 years ago

Hi,

I am merging two assemblies and I am finding that the merged output is dominated by the reference and not the query which is not what I expected based on the documentation.

Below I am showing my contig(s) aligned to chrX in hg38 for both my reference and query. The contigs are colored in alternating blue and orange bars.

My reference looks like: image My query looks like: image And my merge looks like: image

This looks great to start; however, if I align my merged sequence back to the reference sequence I find that the bases match perfectly implying that none of the bases from the query were used. I thought that the reference would only be used to bridge the gaps in the query, not totally replace the smaller contigs from the query. Is this the expected behavior?

Thanks! Mitchell

mahulchak commented 4 years ago

HI Mitchell, Sorry for the confusion but that is expected. Please see Fig. 4 in https://academic.oup.com/nar/article/44/19/e147/2468393 We gave priority to the reference because the reference (often from Canu or a similar assembler) used to be more accurate. If you want to use query sequence in your final assembly, you may have to chop the reference contig into smaller fragments ( so that it is less contiguous than the query) and then use that reference as a query.

On Sat, Apr 25, 2020 at 8:13 PM Mitchell Robert Vollger < notifications@github.com> wrote:

Hi,

I am merging two assemblies and I am finding that the merged output is dominated by the reference and not the query which is not what I expected based on the documentation.

Below I am showing my contig(s) aligned to chrX in hg38 for both my reference and query. The contigs are colored in alternating blue and orange bars.

My reference looks like: [image: image] https://user-images.githubusercontent.com/6935283/80296409-d7aeef00-872f-11ea-8d86-a89b634f9b2e.png My query looks like: [image: image] https://user-images.githubusercontent.com/6935283/80296421-ef867300-872f-11ea-96f2-2c20b079c45f.png And my merge looks like: [image: image] https://user-images.githubusercontent.com/6935283/80296444-2fe5f100-8730-11ea-9d83-73e9b4da23d6.png

This looks great to start; however, if I align my merged sequence back to the reference sequence I find that the bases match perfectly implying that none of the bases from the query were used. I thought that the reference would only be used to bridge the gaps in the query, not totally replace the smaller contigs from the query. Is this the expected behavior?

Thanks! Mitchell

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mahulchak/quickmerge/issues/53, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZQH2GF5QZ5MJRXMSD4STDROORDRANCNFSM4MRA66RQ .

-- Mahul Chakraborty Department of Ecology and Evolutionary Biology University of California-Irvine Phone: 949 824 9559 Fax: 949 824 9559 Website: https://mahulchakraborty.wordpress.com/ Github: https://github.com/mahulchak

mrvollger commented 4 years ago

I should have looked at the paper, thanks for the clarification.

I like your suggestion and I may try it. Thanks!

You might consider changing/clarifying this line on your wiki page:

So the merged assembly receives the most sequences from the query assembly, and the reference assembly provides only the sequences that bridge gaps in the query assembly.

Cheers, Mitchell