malonge / RagTag

Tools for fast and flexible genome assembly scaffolding and improvement
MIT License
470 stars 47 forks source link

Unconventional Use of RagTag #33

Closed goeckeritz closed 3 years ago

goeckeritz commented 3 years ago

Originally posted on RAGOO's github on 11/25/2020 (my bad!)

Hi there,

Recently I ran RagTag on a scaffolded assembly (created with long read sequences and HiC data) from a plant allopolyploid - both suspected progenitors are extant. 1 progenitor has a published genome, which was created from short-reads and scaffolding of a related species. It's quite possible that my scaffolded assembly is actually more contiguous than the reference of this progenitor, but that's aside the point of the question I am interested in. The other suspected progenitor does not have a published genome.

Simply put, since the allopolyploidization event is recent (estimated to be much less than <1 mya), and there is marker evidence that many regions of the genome segregate as a diploid (some don't -- this species is a segmental allotetraploid!), I was interested in using RagTag to estimate the subgenome groupings of the scaffolds. I figured that if the grouping_confidence scores between two scaffolds assigned to the same progenitor reference chromosome were substantially different, the higher scoring one is likely derived from that progenitor. By default, the other scaffold is assigned to the other progenitor.

I'm sure you can think of a number of flaws with this approach -- but the main one I am struggling with at the moment is I don't have a great sense of how to tell when the difference between 2 grouping_confidence scores is substantial enough to assign the scaffolds confidently to a subgenome. I suppose I could do a t-test of the differences of the 8 groupings and see which are significant... but being the creator of RagTag, I was interested in what you thought of this approach?

Ideally I would be doing a Ks comparison between the progenitor and my scaffolds, but my assembly is not yet annotated, so the coding regions haven't been picked out quite yet. I was hoping to label the scaffolds before doing so, but maybe I should just suck it up and name them later! I also thought about using polyCRACKER, but I'm not familiar with docker whatsoever and the thing seemed like it would be a pain in the ass to get running.

Attached is a file containing my confidence scores. Any advice is much appreciated!

Kindly, Charity ragtag.confidence1_16B.xlsx

malonge commented 3 years ago

Hi there,

This is a rather interesting idea. I suppose the location confidence score may actually be more informative for this purpose though. The clustering/grouping score is simply the proportion of covered reference bases for the assigned reference sequence. In other words, this score decreases when a query sequence aligns well to multiple reference sequences, not if it aligns with more divergence to a single reference sequence. However, perhaps you have a reason to suspect that scaffolds originating from the other subgenome would align with more ambiguity across multiple reference chromosomes. For example, I could see this being the case if there was some major TE amplification since the divergence of these progenitor species.

Anyways, like I said, you may also consider looking at the location confidence score as well, though that can be rather noisy. As far as efficacy, I am not sure if this would work or not. It would be interesting to try it with data where both progenitors are available. Perhaps the recent apple pan-genome paper has suitable test data for this scenario?

And if the confidence scores are not suitable for this task, you could still use ragtag to assign each scaffold to a reference chromosome. Then, you can use your own method to further split scaffolds into subgenomes for each chromosome. Once the subgenomes are separated, you can use RaGTag on each subgenome separately to finish scaffolding.

Thanks, Mike

malonge commented 3 years ago

I will close this issue for now but please reopen if you would like to continue the discussion.