mikolmogorov / Ragout

Chromosome-level scaffolding using multiple references
Other
146 stars 27 forks source link

Best practices question--is "overfiting" to the reference sequence a problem? #82

Open DaRinker opened 1 year ago

DaRinker commented 1 year ago

I have 20 de novo hybrid genome assemblies (ONT plus Illumina; flye plus pylon polishing) of different strains of the same species. For an initial genome reference, we also have a high quality (T2T, full chromosome) assembly of a closely related, sister species.

My analysis process at the moment is:

  1. Ragout refine each of my 20 assemblies vs the T2TReference
  2. Rank my ragout scaffolds from best to worst (including consideration of how much of the initial assembly was unplaced)
  3. Iteratively ragout refine each of the remaining 19 assemblies vs the T2TReference PLUS my best assembly's ragout refined scaffolds (i.e. two reference fastas).

Does this sound reasonable, or might this result in the overfitting of the remaining 19 assemblies to that one, best assembly?

While I don't have reason to think that any of my 20 strains should have massive differences between them, I don't want to obscure any smaller differences but over-favoring the assembly that just happed to have the best quality/coverage.

mikolmogorov commented 1 year ago

Hi @DaRinker

I think the simpler strategy of just using T2T reference for each strain may be sufficient. How much structural variance do you expect between the strains? Otherwise I think it makes sense, but definitely makes sense to compare it against the simple "baseline" approach with one reference!

DaRinker commented 1 year ago

Thanks for your input.

The problem with the simplest strategy it that using only the T2T genome for each strain results in a very mixed bag of results. Some of the 20 strains scaffold out nicely (i.e. I get 1-to-1 scaffolds for each of the T2T chromosomes) and others do not (I attribute this behavior to the 0.095 substitutions per site distance between the T2T sister species and each of the 20 strain assemblies). This issue can be remedied by adding back in the best of my scaffolded assemblies (so T2T strain plus best "in-species" assembly).

So I've decided to go with this approach moving ahead

mikolmogorov commented 1 year ago

I see, so some genomes are quite distant from each other - then I think it's a good strategy!