mahulchak / quickmerge

A simple and fast metassembler and assembly gap filler designed for long molecule based assemblies.
GNU General Public License v3.0
192 stars 31 forks source link

Where have all the buscos gone? #28

Open EarlyEvol opened 6 years ago

EarlyEvol commented 6 years ago

Hi,

While trying to decide which order I should merge assemblies, I realized there is something happening with quickmerge which I don't understand.

As you recommended on the wiki, I used the "best" assembly as the query. (best was based just on n50 and busco content). After merging, the number of missing busco genes goes up.

merge_wrapper.py query.fasta ref.fasta -l 90000 -lm 5000

query: C:94.5%[S:93.1%,D:1.4%],F:3.3%,M:2.2%,n:1658

ref: C:92.6%[S:91.9%,D:0.7%],F:2.2%,M:5.2%,n:1658

quick_merged: C:94.0%[S:90.4%,D:3.6%],F:1.4%,M:4.6%,n:1658

I looked in the full output table from busco to get a handle on what is happening. There are indeed genes that are complete in the query assembly but are absent in the merged assembly. Basically all transitions between complete, duplicated, fragmented and missing are occurring during the merging process.

From your publication and your explanation in issue #22, it seems that it should be impossible to lose genes from the query if the reference sequences are only used in gaps to stitch together query contigs, and unaligned query contigs make it to the merged.fasta. Am I missing something?

Thanks, Earl

esolares commented 6 years ago

Hi

Thank you for your email. Have you tried switching the query and reference? Also, did you try one round or two rounds of merging?

Thank you,

Edwin

On Tue, May 8, 2018, 5:07 PM BurlEarl notifications@github.com wrote:

Hi,

While trying to decide which order I should merge assemblies, I realized there is something happening with quickmerge which I don't understand.

As you recommended on the wiki, I used the "best" assembly as the query. (best was based just on n50 and busco content). After merging, the number of missing busco genes goes up.

merge_wrapper.py query.fasta ref.fasta -l 90000 -lm 5000

query: C:94.5%[S:93.1%,D:1.4%],F:3.3%,M:2.2%,n:1658

ref: C:92.6%[S:91.9%,D:0.7%],F:2.2%,M:5.2%,n:1658

quick_merged: C:94.0%[S:90.4%,D:3.6%],F:1.4%,M:4.6%,n:1658

I looked in the full output table from busco to get a handle on what is happening. There are indeed genes that are complete in the query assembly but are absent in the merged assembly. Basically all transitions between complete, duplicated, fragmented and missing are occurring during the merging process.

From your publication and your explanation in issue #22 https://github.com/mahulchak/quickmerge/issues/22, it seems that it should be impossible to lose genes from the query if the reference sequences are only used in gaps to stitch together query contigs, and unaligned query contigs make it to the merged.fasta. Am I missing something?

Thanks, Earl

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mahulchak/quickmerge/issues/28, or mute the thread https://github.com/notifications/unsubscribe-auth/AEI6vlyD4qdfHORCHR2ExaWTsHZnq18Iks5twjKmgaJpZM4T3g65 .

EarlyEvol commented 6 years ago

Edwin, Thanks for the speedy reply!

I have merged in both directions. Each time busco genes are lost similar to above. basically the merged.fasta is always in the middle of the the query/ref assemblies when it comes to missing busco genes. I think there is something fundamental about QM that I don't understand. Am I correct in thinking the unaligned query contigs are added to the merged.fasta file? Also having a gene go from complete in the query to fragmented in merged.fasta is hard to explain. Is QM breaking query contigs? Do you mean the 2 rounds mentioned in second part of the wiki? i.e. (asm1 + asm2) + asm 1(2). I have not tried that. I'll give it a try after I get a handle on whats happening in the first round.

Earl

EarlyEvol commented 6 years ago

Well, I have figured out part of my confusion. In the QM pub, it states "In the final step, the ordered chain of contigs found in the previous step is joined by swapping portions of the reference assembly into the query assembly in a manner that maximizes retention of sequences from the reference assembly (Figure 4A)." but in the wiki (and elsewhere) it states "So the merged assembly receives the most sequences from the query assembly, and the reference assembly provides only the sequences that bridge gaps in the query assembly."

Hmm, either way, one merge direction should have >= #complete BUSCOs right? (which I am not seeing)

mahulchak commented 6 years ago

If there are misassemblies in either of the two assemblies, part of a misassembled contig can get discarded. This is because splicing points are at the locations where correspondence between the two assemblies (aligned contigs from two assemblies) end. If a misassembled segment lies outside that aligned homologous chunk, it may not make into the final merged assembly. You can try the second round to see if some of the missed busco are gained back.

On Tue, May 8, 2018, 21:57 BurlEarl notifications@github.com wrote:

Well, I have figured out part of my confusion. In the QM pub https://academic.oup.com/nar/article/44/19/e147/2468393, it states "In the final step, the ordered chain of contigs found in the previous step is joined by swapping portions of the reference assembly into the query assembly in a manner that maximizes retention of sequences from the reference assembly (Figure 4A)." but in the wiki (and elsewhere) it states "So the merged assembly receives the most sequences from the query assembly, and the reference assembly provides only the sequences that bridge gaps in the query assembly."

Hmm, either way, one merge direction should have >= #complete BUSCOs right? (which I am not seeing)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mahulchak/quickmerge/issues/28#issuecomment-387596272, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMD6F7DFx4Pf17D1lfcfw4rZtjUBviQks5twkzAgaJpZM4T3g65 .

EarlyEvol commented 6 years ago

hmm...could you speak a little more about how this process works?
If there are inconsistencies between assemblies, i.e. contig QueryA and contig RefA align for the first half of their length but not the second half, which takes precedence? Is the second half of RefA clipped and overlap expansion continues, is it the other way around? Or is it based on something other than Q vs. R? Are the clipped sections of contigs allowed to be used in other overlap paths, or are they removed from the available pool?

In the wiki, you mention breaking unreliable contigs before merging. Do you know of a script to do this on a large scale? Like take .delta and .rq.delta files and point out inconsistent? I still have 10K contigs and don't think I can do it by hand. If there isn't one, I think I could write a program to take one-one alignments of Query and Ref, then note where they stop in the middle of a contig (as you mention above). After identifying inconsistencies, using bedtools to print split contigs should be easy.

Thanks, Earl

EarlyEvol commented 6 years ago

Edwin and Mahul,

Also, the large amount of duplicated BUSCOs after running QM is a little disconcerting, and like #15, every run of QM seems to increase the duplication rate. The duplicates are mostly on different contigs, so they aren't false inverted repeats caused by PB adapter skipping. The majority of duplicates have one copy on a contig named for the query and the second on a contig named from the reference. So, if my understanding is correct, one participated in merging and the other didn't.

Thanks, Earl

mahulchak commented 6 years ago

Hi Earl, We don't have a formal script/software to break unassembled contigs. I have thought about writing it but couldn't manahe time to write one. If you are able to come up with a script, please let us know.

The duplication issue could be a bug. What is your ml value ? Also, would you be able to send me a query contig and the reference contig that resulted in a duplicate-increasing merged contig? I'm currently traveling so please excuse my slow responses. Mahul

On Thu, May 10, 2018, 21:56 BurlEarl notifications@github.com wrote:

Edwin and Mahul,

Also, the large amount of duplicated BUSCOs after running QM is a little disconcerting, and like #15 https://github.com/mahulchak/quickmerge/issues/15, every run of QM seems to increase the duplication rate. The duplicates are mostly on different contigs, so they aren't false inverted repeats caused by PB adapter skipping. The majority of duplicates have one copy on a contig named for the query and the second on a contig named from the reference. So, if my understanding is correct, one participated in merging and the other didn't.

Thanks, Earl

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/mahulchak/quickmerge/issues/28#issuecomment-388236394, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMD6FFylTawX2NXkZYjXJroC0dKTHrbks5txO-WgaJpZM4T3g65 .

EarlyEvol commented 6 years ago

Alright, I put a post on Biostars asking if anyone has written one. Seems like it should have been done before, especially with people getting all sorts of different data types and using different assemblers. If I don't hear anything, hopefully I'll get time to write something to do this soon.

The -ml value I used was 5000 I'll try some higher values and see if that removes some of the duplicates.

The alignment of the whole gene sequences shows 100% match...so not real. I emailed the contigs and a bed file with the duplicated genes locations to your UCI address.

Earl

ViriatoII commented 4 years ago

Hey, I have the same problem.

I have 90 missing Buscos in relation to QUERY (10x assembly) and 130 in relation to REFERENCE (canu assembly). Switching reference and query doesn't help, neither does a second round of quickmerge.

So is there a way of recovering these lost segments? Did you solve your problem, @EarlyEvol ?

Here I searched the missing buscos of the hybrid assembly in the original query (10x):

# Busco id  Status  Contig  Start   End Score   Length
1   Complete    196 789843  813344  10268.5 4703
10148   Missing
10182   Complete    604 110438  116766  725.8   418
10188   Complete    295 316700  318765  989.5   453
10280   Complete    130984  15058   32477   875.5   456
10522   Complete    582 59007   64776   941.0   443
(...) it goes on (...)

I could just copy these contigs into my hybrid assembly, but genes that are unkown by busco will be lost.

mahulchak commented 4 years ago

It is hard to say what really happened.What parameters did you use for c, hco, l,ml ? Can you take a look at those contigs in the canu vs 10x nucmer alignment (the one that quickmerge used) and see if you see any inconsistencies between the two assemblies?

On Wed, Jul 31, 2019, 17:34 ViriatoII notifications@github.com wrote:

Hey, I have the same problem.

Does it make sense that genes are lost in relation to the query?? I have 90 missing Buscos in relation to QUERY (10x assembly) and 130 in relation to REFERENCE (canu assembly).

Here I searched the missing BUSCOS of the hybrid assembly in the original query (10x):

Busco id Status Contig Start End Score Length

1 Complete 196 789843 813344 10268.5 4703 10148 Missing 10182 Complete 604 110438 116766 725.8 418 10188 Complete 295 316700 318765 989.5 453 10280 Complete 130984 15058 32477 875.5 456 10522 Complete 582 59007 64776 941.0 443 (...) it goes on (...)

Did you solve your problem, @EarlyEvol https://github.com/EarlyEvol ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mahulchak/quickmerge/issues/28?email_source=notifications&email_token=ABZQH2BFYUQJZIKLANKBTELQCF5ONA5CNFSM4E66B242YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3HARXA#issuecomment-516819164, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZQH2FONKPODTISB2O2RE3QCF5ONANCNFSM4E66B24Q .

esolares commented 4 years ago

Hi,

My apologies for the late reply. We have a colleague from the NIH that tested the merging of a 10x assembly and a pacbio assembly. The result of this merge caused exactly what you mentioned in your email and worse. Since 10x is filled with N's and the # of N's can also be inaccurate, you get regions from the pacbio assembly which are replaced by N's and at times also concatenated, thereby causing a drop in BUSCO's. quickmerge assumes the alignments are between actual sequences and not scaffolds. If you would like to use 10x data (which I do not recommend), you could try eliminating the N's from the 10x assembly and then use positional information from the original 10x assembly to perform scaffolding after merging. I still however do not recommend this, because you could be introduce many missamblies from the 10x assembly. Our colleague found a large amount of discordance between the two assemblies that could cause serious issues in the subsequent merges using quickmerge. You can investigate this by converting your delta file into an alignment file either via mummer or a script that can convert a delta file to a sam file. If you require further assistance with this, please feel free to ask.

Please let me know if you have any other questions.

Thank you,

Edwin Solares, M.S. PhD Candidate in Comparative Genomics and Evolutionary Biology Department of Ecology and Evolutionary Biology Gaut Lab 5438 McGaugh Hall University of California, Irvine Irvine, CA 92697 USA

On Wed, Jul 31, 2019 at 10:44 AM Mahul Chakraborty notifications@github.com wrote:

It is hard to say what really happened.What parameters did you use for c, hco, l,ml ? Can you take a look at those contigs in the canu vs 10x nucmer alignment (the one that quickmerge used) and see if you see any inconsistencies between the two assemblies?

On Wed, Jul 31, 2019, 17:34 ViriatoII notifications@github.com wrote:

Hey, I have the same problem.

Does it make sense that genes are lost in relation to the query?? I have 90 missing Buscos in relation to QUERY (10x assembly) and 130 in relation to REFERENCE (canu assembly).

Here I searched the missing BUSCOS of the hybrid assembly in the original query (10x):

Busco id Status Contig Start End Score Length

1 Complete 196 789843 813344 10268.5 4703 10148 Missing 10182 Complete 604 110438 116766 725.8 418 10188 Complete 295 316700 318765 989.5 453 10280 Complete 130984 15058 32477 875.5 456 10522 Complete 582 59007 64776 941.0 443 (...) it goes on (...)

Did you solve your problem, @EarlyEvol https://github.com/EarlyEvol ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/mahulchak/quickmerge/issues/28?email_source=notifications&email_token=ABZQH2BFYUQJZIKLANKBTELQCF5ONA5CNFSM4E66B242YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3HARXA#issuecomment-516819164 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ABZQH2FONKPODTISB2O2RE3QCF5ONANCNFSM4E66B24Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mahulchak/quickmerge/issues/28?email_source=notifications&email_token=ABBDVPTLBVMVGOXD3AZS7Z3QCHFOHA5CNFSM4E66B242YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3IAS7A#issuecomment-516950396, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBDVPXMZLXAEDUDZJGMOSTQCHFOHANCNFSM4E66B24Q .

ViriatoII commented 4 years ago

Hey, thank you both for the useful answers! @mahulchak , I used all default parameters. Here is the mummerplot of the quickmerge run:

mumuumum

So, I understand that a 10x assembly is not a good input for quickmerge. @mahulchak , do you confirm this?

And @esolares , do you then have an ideal way on how to combine the strengths of 10x and pacbio to create a better assembly?

Really appreciate your help, Ricardo

esolares commented 4 years ago

My ideas for combining both 10x and pacbio, would be to first take the 10x assembly and convert to contig form (i.e. removing the N's). then performing quickmerge on this. I don't think it will help much, but you can try. The second, would be to take the 2x quickmerged assembly and scaffold it against the 10x assembly, for filling gaps.

Please let me know if yo have any questions.

Thank you,

Edwin Solares, M.S. PhD Candidate in Comparative Genomics and Evolutionary Biology Department of Ecology and Evolutionary Biology Gaut Lab 5438 McGaugh Hall University of California, Irvine Irvine, CA 92697 USA

On Fri, Aug 2, 2019 at 6:56 AM ViriatoII notifications@github.com wrote:

Hey, thank you both for the useful answers! @mahulchak https://github.com/mahulchak , I used all default parameters. Here is the mummerplot of the quickmerge run:

[image: mumuumum] https://user-images.githubusercontent.com/25958099/62374731-dd7af980-b53c-11e9-8675-d705d08b1b25.png

So, I understand that a 10x assembly is not a good input for quickmerge. @mahulchak https://github.com/mahulchak , do you confirm this?

And @esolares https://github.com/esolares , do you then have an ideal way on how to combine the strengths of 10x and pacbio to create a better assembly?

Really appreciate your help, Ricardo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mahulchak/quickmerge/issues/28?email_source=notifications&email_token=ABBDVPXHE2QEQPJPINC52VLQCQ4E7A5CNFSM4E66B242YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3N2A7I#issuecomment-517709949, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBDVPR7LU5RDZN7EIWXAS3QCQ4E7ANCNFSM4E66B24Q .