Misleading allotetraploid sturgeon k-mer histogram

hannesbecher / shiny-k-mers

Tetemer, an R package and Shiny app for interactively fitting population parameters to k-mer spectra of diploids, triploids, and tetraploids (allo and auto)

GNU General Public License v3.0

11 stars 1 forks source link

Hi Hannes, I would like to ask for your advice regarding a problem that I have with the k-mer histogram of sturgeon species probably allotetraploid (Acipenser naccarii).

The problem is that the histogram is a bit strange, it does not look like a usual allotetraploid histogram and the Tetmer model does not adjust correctly, so the values of theta, T, nucleotide divergence, etc. are unexpected (I tried to fix the fit of the model manually but it did not change anything). Although I have good coverage (89x, that is, ~22x per haploid), the first peak of the histogram does not separate from the contamination peak (figure 1 attached). I thought there was a coverage problem so I artificially increased the coverage by combining the reads of two individuals to get up to 142x but the problem keeps popping up. I was wondering if what tetmer marks as the first peak is actually the second peak and the first is what I put inside the red circle of figure 2 that I am attaching to this issue. Do you think this is a coverage problem? Or is there something else that I am missing? Another thing to consider is the evolutionary history of this species which could create a bias in the analysis. It is hypothesized that this species had reached a level of octaploidy before undergoing a process of diploidization, therefore it is currently considered to be tetraploid even if some loci could still be octaploid or even diploid (because the process of diploidization occurs at different speeds within the genome).

I can send you the histograms that I used on tetmer if you want to have a look. Waiting for your answer, thank you!

Víctor Muñoz

Hi Víctor,

Thanks for getting in touch!

First of all, I don’t recommend merging k-mer data sets from multiple individuals. They are likely to contain different genetic variants generating additional peaks in your spectrum (unless the individuals are clones). Also it is unlikely that all samples where sequenced at exactly the same depth and peaks would not align. So, let’s focus at your top (single-individual) plot.

You have cut of the y-axis a bit low and it is hard to see the first peak (multiplicity approx. 12). It would be good to know whether this is a data peak or contamination. One way to check would be to generate a quick and dirty assembly and to run blobtools on it. Even quicker would be to use smudgeplot.There are then two options:

The multiplicity 12 peak is due to contamination: This would be unfortunate. Because this peak overlaps with the true 1x peak, fitting parameters is unreliable. You could still try and I’d be happy to help.

The multiplicity 12 peak is a data peak: You are dealing with an octoploid spectrum. According to what you told me this seems plausible. Tetmer is not made for octoploids but could be extended for auto-octoploids. I’d be interested to try this. Allo would be too complicated because there are too many possible homology relationships with eight genomes (and in reality you are probably dealing with some intermediate state).

If you send me your spectrum to the contact email that is in the tetmer paper then I'm happy to take a look.

All the best, Hannes

hannesbecher / shiny-k-mers

Misleading allotetraploid sturgeon k-mer histogram #8