schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
258 stars 56 forks source link

Model Fit vs k-mer size #32

Open bmansfeld opened 4 years ago

bmansfeld commented 4 years ago

Hi Mike, Hope all is well with you I'm revisiting our GenomeScope analyses with cleaner reads and was curious about a trend I see in many papers. As well as looking at other k-mer based assembly evaluation tools such as Merqury

It seems that in lieu of knowing what size k-mer to use, many people (self included) run Jellyfish with different k's and after using GenomeScope to estimate their genome parameters use the Model fit reported by GS to select the optimal value for k.

The fit values usually range within the 90's so I wonder how useful/accurate this approach is...

Furthermore, it seems that in some cases people report moderately different results from GS using different k's.

In my case its rather marginal to the final result (704 vs 711Mb size; 1.4 vs 1.33% het) , but using different ks results in visibly different histograms. k=21 - http://qb.cshl.edu/genomescope/analysis.php?code=IKX3YGi4tZUIurkadcZO k=31 - http://qb.cshl.edu/genomescope/analysis.php?code=Dzsv8ydHbwFiRgfkmyOe

Which makes me wonder if there is a "correct" value for k? I've done some searching but answers are confusing.

The tool I mention above (Merqury) has bash script to calculate best k which seems a rather simple equation based on Fofanov et al (2004).

Software like kmergenie estimate the best kmer length for assembly by finding the max distinct kmers. Is this appropriate for size and heterzygosity estimation? In any case any advice would be welcome. How should one find the best k for running GS? Thanks in advance, Ben

In general do you have best practice re

mschatz commented 4 years ago

Hi Ben,

Thanks for your interest. I would say this is great consistency between the two runs: a 1% variance in genome size and a 5% variance in the expected heterozygosity rate is within the limits of the data. I noticed your kmer peaks are truncated at 100,000 which probably contributes to some of this variance, especially for the genome size. If you increase this to 1,000,000 I would be the values to be even closer.

For most genomes, we recommend a kmer length of 21 to balance out repetitive elements vs the impact of sequencing errors and heterozygosity. This is more-or-less in agreement with the value Merqury suggests over typical genome sizes (a few megabases to a few gigabases). If you use a shorter kmer, then more of the genome becomes repetitive until the peaks are poorly resolved. If you use a longer kmer, more of the kmers will be unique, although this will increasingly obscure how the heterozygosity is distributed within the genome. For a given amount of heterozygosity observed in the kmers, GenomeScope infers the level of heterozygosity at the nucleotide level by assuming the heterozygosity is distributed at a uniform rate. This was necessary for modeling purposes, but in practice, heterozygosity is not uniformly distributed due to variable selective pressures (coding/non-coding etc) so there can be some variability depending on the kmer length used.

As an extreme example, in a typical human genome there is a heterozygous variant every 1000bp. So, if you used 1,000 mers or longer mers almost every kmer in a human genome would look heterozygous but it is ambiguous as to if every nucleotide is heterozygous or only 1 in 1000. As you shrink the kmer length, fewer and fewer kmers will overlap a heterozyous nucleotide until the true value can be found of about 0.1%.

Hope this helps!

Mike

On Mon, May 4, 2020 at 10:49 PM Ben Mansfeld notifications@github.com wrote:

Hi Mike, Hope all is well with you I'm revisiting our GenomeScope analyses with cleaner reads and was curious about a trend I see in many papers. As well as looking at other k-mer based assembly evaluation tools such as Merqury https://github.com/marbl/merqury

It seems that in lieu of knowing what size k-mer to use, many people (self included) run Jellyfish with different k's and after using GenomeScope to estimate their genome parameters use the Model fit reported by GS to select the optimal value for k.

The fit values usually range within the 90's so I wonder how useful/accurate this approach is...

Furthermore, it seems that in some cases people report moderately different results from GS using different k's.

In my case its rather marginal to the final result (704 vs 711Mb size; 1.4 vs 1.33% het) , but using different ks results in visibly different histograms. k=21 - http://qb.cshl.edu/genomescope/analysis.php?code=IKX3YGi4tZUIurkadcZO k=31 - http://qb.cshl.edu/genomescope/analysis.php?code=Dzsv8ydHbwFiRgfkmyOe

Which makes me wonder if there is a "correct" value for k? I've done some searching but answers are confusing.

The tool I mention above (Merqury) has bash script https://github.com/marbl/merqury/blob/master/best_k.sh to calculate best k which seems a rather simple equation based on Fofanov et al (2004).

Software like kmergenie estimate the best kmer length for assembly by finding the max distinct kmers. Is this appropriate for size and heterzygosity estimation? In any case any advice would be welcome. How should one find the best k for running GS? Thanks in advance, Ben

In general do you have best practice re

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/32, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP347AMRJAZ7FPFZXZP6DRP55DNANCNFSM4MZG7VTQ .

bmansfeld commented 4 years ago

Mike as usual thanks for the quick and informative reply. The example helps understand a lot. Oh and I agree I think these numbers are all fair estimates in both values of k.

I noticed your kmer peaks are truncated at 100,000 which probably contributes to some of this variance, especially for the genome size. If you increase this to 1,000,000 I would be the values to be even closer.

I'm not sure I understand your meaning about truncated peaks. I've included an upper count of 1e6 in jellyfish and set that as max kmer coverage in GS as well. Am I confusing these terms?

Yes, Merqury suggested a k-mer of 20 as optimal for our genome size whether diploid or haploid. But looking at the distributions I noticed that the 1x-het-peak looks shorter with that value, and was wondering if we were excluding potential hets. As you say, larger k more unique kmers.

So, where does the model fit... um fit? It seems arbitrary to me to use this as a "kmer selection method"? What are your thoughts?

Thanks again, Ben

mschatz commented 4 years ago

Since jellyfish imposes a maximum count on the histogram, there might be kmers that occur more than 100,000 times, just they are reported as 100,000 times in the file. The number (and sequence) of kmers that exceed this limit will depend in part on the kmer length -- short kmers are less impacted by the ends of the reads and sequencing errors to tend to have higher coverage. So you might partially stabilize the results by increasing the maximum kmer count to around 500,000 times so all kmers are considered in both runs.

And for your second question, the relative heights of the peaks directly depends on the choice of k since a single heterozygous base leads to 2*k heterozygous kmers (k from the maternal haplotype and k from the paternal). Consequently, the heterozygous peak for 31mers will be about 50% taller than it will be for 21mers from the same data (it is actually a little bit more subtle than this since changing k will also change the average kmer coverage). The GenomeScope model knows how to correct for this, which is why the estimates different by only a few percent and not 50%. I would encourage you to read the GenomeScope 1.0 supplement where these factors are discussed - you may even want to implement Supplemental equations 2 and 3 in R/python so you can play with how different parameters impact the shapes of the curves. The nls_4peak function here will get you started: https://github.com/schatzlab/genomescope/blob/master/genomescope.R

Hope this helps

Mike

On Thu, May 7, 2020 at 11:34 PM Ben Mansfeld notifications@github.com wrote:

Mike as usual thanks for the quick and informative reply. The example helps understand a lot. Oh and I agree I think these numbers are all fair estimates in both values of k.

I noticed your kmer peaks are truncated at 100,000 which probably contributes to some of this variance, especially for the genome size. If you increase this to 1,000,000 I would be the values to be even closer.

I'm not sure I understand your meaning about truncated peaks. I've included an upper count of 1e6 in jellyfish and set that as max kmer coverage in GS as well. Am I confusing these terms?

Yes, Merqury suggested a k-mer of 20 as optimal for our genome size whether diploid or haploid. But looking at the distributions I noticed that the 1x-het-peak looks shorter with that value, and was wondering if we were excluding potential hets. As you say, larger k more unique kmers.

So, where does the model fit... um fit? It seems arbitrary to me to use this as a "kmer selection method"? What are your thoughts?

Thanks again, Ben

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/32#issuecomment-625611658, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP342GF3PMF43RQS5E43DRQN4TPANCNFSM4MZG7VTQ .

bmansfeld commented 4 years ago

Thanks mike! I'll re-read the supplement.