mikolmogorov / Ragout

Chromosome-level scaffolding using multiple references
Other
145 stars 27 forks source link

When scaffolding against a high quality reference, what do outputs of "hybrid" scaffolds most likely indicate? #84

Open DaRinker opened 1 year ago

DaRinker commented 1 year ago

I am scaffolding several ONT+illumina assemblies (flye) against a T2T reference genome of a sister species. Each of my assemblies represents a specific strain of my species of interest--so while I expect some variation I don't necessarily expect massive structural rearrangements.

And, MOST of my output scaffolds find 1-to-1 correspondence to the T2T reference. However I'm seeing multiple instances where my output scaffolds are not always so clear cut. For example in 70% of my samples I get a single scaffold corresponding to "chr_1" (using the T2T headers) but in 30% I see "chr1" PLUS "chr1_chr6" (so looking like a chunk for chr1 moved to chr6). And it's not just a random thing as most of these "hybrid" scaffolds (when they appear) are always the same pairs of reference chromosomes.

Since I've tried multiple strategies (assembly parameter variation, different ragout reference sequences) I'm beginning to think that what I'm seeing is at least supported by my sequencing data. Are there any "sanity checks" can I do (within the ragout framework) to convince myself that what appear to be chromosomal translocations are actually real?

DaRinker commented 1 year ago

In looking at my output scaffolds in more detail, I don't think they're all correct. Not sure why, but for one reference chromosome, ragout is consistently inserting lots of small fragments that both a) do not align well to my reference and b) end up extending some scaffolds by over 2Mb(!!).

UPDATE: I tried soft masking all my contigs, as well as softmasking the T2T assembly, but nothing I try seems to stop this behavior. And it occurs in ALL my de novo assembles samples, so it's something beyond a random edge case...

mikolmogorov commented 1 year ago

Can you post the log file? Are you using the repeat resolution mode? How small are the fragments? You can perharps adjust the synteny block size to prevent it from inserting.

In general, Ragout won't make a connection (e.g. chr1 - chr6 fusion) unless there is evidence of it in at least one of the reference genomes or the target genome. For each iteration, you should have the file with synteny block order in each genome, and you may be able to tell which reference supports the fusion. Also, the links file should have the list of genomes that support each adjacency. If you can pinpoint which adjacency corresponds to the fusion, you can see which genomes support it.

DaRinker commented 1 year ago

In general, Ragout won't make a connection (e.g. chr1 - chr6 fusion) unless there is evidence of it in at least one of the reference genomes or the target genome.

This is useful. Since no references support the translocation, it sounds like I can assume the evidence is coming from the target assembly itself. And since the translocations I'm seeing DO make (parsimonious) sense within the phylogenetic context, I'm starting to think they may be real.

mikolmogorov commented 1 year ago

Could be! If I remember correctly, Ragout may keep an adjacency that is unsupported by references if (i) it sees complementary breakpoints (e.g. in inversion should have two) and (ii) it should not alter the chromosome structure significantly. Does chr1-chr6 fusion lead to a kariotype change? If so, not sure why this happens.. But if its more like a smaller translocation, it means that all its breakpoints should be contained in your assmebled genome.