mahulchak / quickmerge

A simple and fast metassembler and assembly gap filler designed for long molecule based assemblies.
GNU General Public License v3.0
198 stars 31 forks source link

sometimes quick merge does not merge contigs? #67

Open zmz1988 opened 2 years ago

zmz1988 commented 2 years ago

Dear developers, thanks a lot for writing this nice tool! I used the quickmerge frequently to merge my assemblies from PacBio and Nanopore. Most of the time I see a successful big improvement of NG values, but sometimes quickmerge doesn't seem to merge the query contigs, though no errors was reported and all files were generated.

For example, I used Nanopore assembly (NG50 ~ 10M) as the reference to merge contigs from PacBio assemblies (NG50 ~ 4M). In the failed case, the resulted merged assembly has the same NG50 value (or only several k bp difference) and the same number of contigs as the query assembly, even if the parameter -l was set to the N50 value of the reference (Nanopore). In this case, if I lower the -l value significantly, say 2M, then the continuity of the resulted assembly is improved. But I'm kind of hesitated to use the merged ones generated with a lower -l value...

I had merged around 8 genomes, among which I had three failed cases. I don't know where could be the problem, as the contigs seem aligned well between nanopore and PacBio assemblies, when I aligned them by mummer outside of quickmerge. Could you please give me some hints where the problem could be? Or how should I deal with this problem?

Thanks a lot in advance!

mahulchak commented 2 years ago

Hi, from your description, I don't see a 'problem' here. Using contig N50 as the seed length is more like a rule of thumb. The rule of thumb is mentioned in the Readme to guard against spurious merging. However, it does not necessarily mean that lowering that cut off will always lead to false merges. If you're merging with a lower cutoff (2M) than your contig N50 and you trust the merging events, then that is okay.

On Thu, Oct 21, 2021 at 7:20 AM zzz @.***> wrote:

Dear developers, thanks a lot for writing this nice tool! I used the quickmerge frequently to merge my assemblies from PacBio and Nanopore. Most of the time I see a successful big improvement of NG values, but sometimes quickmerge doesn't seem to merge the query contigs, though no errors was reported and all files were generated.

For example, I used Nanopore assembly (NG50 ~ 10M) as the reference to merge contigs from PacBio assemblies (NG50 ~ 4M). In the failed case, the resulted merged assembly has the same NG50 value (or only several k bp difference) and the same number of contigs as the query assembly, even if the parameter -l was set to the N50 value of the reference (Nanopore). In this case, if I lower the -l value significantly, say 2M, then the continuity of the resulted assembly is improved. But I'm kind of hesitated to use the merged ones generated with a lower -l value...

I had merged around 8 genomes, among which I had three failed cases. I don't know where could be the problem, as the contigs seem aligned well between nanopore and PacBio assemblies, when I aligned them by mummer outside of quickmerge. Could you please give me some hints where the problem could be? Or how should I deal with this problem?

Thanks a lot in advance!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mahulchak/quickmerge/issues/67, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZQH2G2HJAF2KYPNBMXCO3UIAOJJANCNFSM5GOIOXCQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Mahul Chakraborty Department of Ecology and Evolutionary Biology University of California-Irvine Phone: 949 824 9559 Fax: 949 824 9559 Website: https://mahulchakraborty.wordpress.com/ Github: https://github.com/mahulchak

zmz1988 commented 2 years ago

Hi, Thanks for replying me. Yes, I found that lowering the cutoff 2M doesn't introduce more duplicated sequence. So it's fine. But I recently realised that those places that can't be merged are mostly heterozygous places. For example, the reference assembly has seq1(haplotype A) + seq2(haplotype A) in one contig, however the query assembly has contig1(haplotype A) and contig2(haplotype B). Though the reference assembly remains the alternative allele of seq2 (could be aligned to contig2(haplotype B) in query assembly) in the whole genome file as a small contig, but the contig1 and contig2 from query assembly will still not be merged together, as it lacks hints where this haplotype B should be placed.

I'm not sure how I can solve this problem without generating a phased assembly (our species is highly inbred). But I do have quite some gaps because of this reason, though the reference genome are pretty gapless but not with high QV. Do you think whether we could employ gfa file in this case?

mahulchak commented 2 years ago

I have not really experimented with gfa file in this context. I will have to think about it.