tanghaibao / jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics
BSD 2-Clause "Simplified" License
743 stars 187 forks source link

Wrong orientation with Linkage Map #231

Open francicco opened 4 years ago

francicco commented 4 years ago

Dear @tanghaibao,

I'm experiencing problems using AllMaps with LM. In some instances, the scaffolding with the LM with AllMaps generates wrong orientation, resulting in artifacts in the inversion. Here I'll show you one.

Screen Shot 2020-04-03 at 11 47 28 In this chr reconstruction, I have reason to believe the at least the two larger scaffolds were miss-oriented in the final scaffolding. That happens in other chrs too.

I also got a false chr fusion: Screen Shot 2020-04-03 at 11 50 37

I think there's a problem in the way AllMaps interpret the information of the LM. Would you help me with this issue, I'd like to keep using Allmaps avoiding to use another software, like Lap-Anchor, for many reasons.

Thanks a lot F

tanghaibao commented 4 years ago

@francicco

In both cases, you have a single contig that appears to be chimeric when compared to the linkage maps. Remember that ALLMAPS will not attempt to make a split within the contig.

So in short, these 'misorientations' or artifacts already exist in that input contig. Since the main design goal of ALLMAPS is to find the order and orientation between different contigs and not within, it will not automatically correct it.

If you insist on correcting this however, you have to split your input contigs. While I don't think it can do anything about the first case (203003), but for the second case (209001), you can follow the instructions here, or do some scripting yourself to split the sequences prior to ALLMAPS.

Haibao

francicco commented 4 years ago

Wait a minute @tanghaibao, those inverted contigs are single contigs. This is the synteny after the LM scaffolding. I cannot show you the LM dot-plot because AllMaps fails to plot it. Makes sense? F

francicco commented 4 years ago

I've been talking with Pasi Rastas, the guy who developed Lap-MAP3. He said to me "Allmap's default behavior is not very good; for some reason it does not orient all contigs even if the data is clear" and he continues "Obviously I am recommending Lep-Anchor to be used but allmaps can be made work better by sampling multiple maps from intervals output by Lep-MAP3 (see Lep-Anchor paper). Just make 10 independent samples (by sampling one position within each interval) and give them all to allmaps."

So the point is that I honestly don't know how to do the "sampling", I'm not familiar at all with this methodology, but maybe you can help me? I'm kind of in a dead-end, since Lep-Anchor it's not very friendly to use it.

I'd appreciate your help a lot, I'm spending so much time trying to fix this assembly, and it's a shame because the data is there to make it good!

Best, F

tanghaibao commented 4 years ago

@francicco

In both of these cases, ALLMAPS did nothing other than allowing the 'chimeric' contig to pass through, so in other words, it didn't join any sequence together. Like I said earlier, if the input has errors in it within the contig, ALLMAPS can't do anything to correct them.

If you have contigs A, B, C, ALLMAPS will order and orient them, to a form like ABC. If you have inversion within contig A, ALLMAPS can't correct it, and output the same A as before.

I am not familiar with Lap-MAP3 at all and was not sure that I understood what problem the sampling was intended to solve. This is an issue of whether you trust linkage map or de novo assembly more here. I don't think sampling attacks the heart of the problem.

Haibao

francicco commented 4 years ago

"First, we notice that LA reports five chimeric scaffolds for theyellow catfish data (scaffolds 44, 75, 123, 165 and 230), whereasALLMAPS on default puts all contigs into at most one chromosome.The mapping of the two genomes against each other with minimap2(Li, 2018) supports all the chimerics found by LA. The start of scaf-fold 230 maps to unanchored part of GT (suggesting that GT couldbe improved with this linkage map) and the end to chromosome 23.The other four chimerics map to different chromosomes consistentwith the marker positions in the LAs result. We tried the option splitin ALLMAPS, meant to cut chimeric scaffolds: It reported 47 break-points within scaffolds. One of the ALLMAPS breakpoint scaffolds,scaffold 44, was also reported by LA but the reported breakpointwas not the same. Thus, all the five verified chimerics were missedby ALLMAPS.Second, we compared the score (number of supporting markers)of the two programmes. LA reports the final result with the total of6061 supporting markers summing up all 26 chromosomes, whereasALLMAPS reports 5979 supporting markers. We verified the sup-port by evaluating ALLMAPS results with LA. According to LA,ALLMAPS result had 6019 supporting markers counted by LA: in10 chromosomes, the scores were equal and in the remaining 16chromosomes, the score calculated by LA was greater. The differ-ence is most likely due to ALLMAPS removing 225 markers (2.3%)as outliers.". Lep-Anchor: automated construction of linkage map anchored haploid genomes -Pasi Rastas

https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz978/5698268

tanghaibao commented 4 years ago

@francicco

Thank you for the reference. This helps.

It seems that LA pays special attention to the chimerics, which are not targeted by ALLMAPS. I don't have a particularly good idea on how to generically solve this issue though. It is not trivial to split input contigs (where to break, and how do you propagate evidence) generically.

When I wrote ALLMAPS initially circa 2014, most of the de novo assembly at the time had smallish contigs/scaffolds and chimerics presented a smaller problem.

Finally, If you don't have a way to split the contigs before ALLMAPS, then sadly your best option may be just to use LA 😞

Haibao

francicco commented 4 years ago

LA is as easy to use as maneuvering a helicopter. F