schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
253 stars 56 forks source link

Low fit with simple genome #70

Closed soungalo closed 2 years ago

soungalo commented 2 years ago

Hello, I ran genomeScope 2.0 on a data set of reads derived from a plant with a homozygous diploid genome of estimated size ~5.2 Gb. The sequencing depth is ~15x. I created the k-mer histogram using KMC3 according to the instructions on the web page.
The observed histogram looks as I would expect - single peak around the k-mer coverage. However, for some reason I see low fit and a ten fold underestimation of the genome size. Below you can see some results of different runs, but all seem to have the same problems. Looking at the plots, it seems that the full model just doesn't fit very well and the error model behaves pretty strange too.
21-mers - max 10000 21-mers - max 10000000 21-mers - max 10000, min 3 21-mers - max 10000, Max k-mer coverage=50 - this one actually has good fit, but the genome size is still small. 51-mers - max 10000

Since the k-mer coverage peak is at 12, read length is 150 and the number of reads is ~0.5E09, I'd expect a genome size estimation of G = ((150-21) * 0.5E09 * 150)/(150*12) = 5.3Gb. Am I doing something wrong, or is there something going on with my data?
Thanks.

mschatz commented 2 years ago

Thanks for your interest. The automatic model fitting routines get confused with low coverage data like you have. If you run GenomeScope on the command line there are a few extra options that can be applied that do a better job (I also just added a few more for your data - make sure to pull the latest code from github).

This command gives a pretty good fit (using your "21-mers - max 10000" histogram)

$ ~/build/genomescope2.0/genomescope.R -i glick.hist -o kcov7.shift2.typicalerror3 -p 2 -k 21 --kcov 7 --start_shift 2 --typical_error 3

This estimates a haploid genome size of 3.6Gbp (so a diploid size of 7.2Gb) and essentially zero heterozygosity (totally inbred?). This is bigger than your estimate, but I find the sequence based estimates to be more reliable than other methods (like flow cytometry) since you are making billions of observations of what is present in the genome

Good luck! Mike

On Sat, Jan 15, 2022 at 3:20 AM Lior Glick @.***> wrote:

Hello, I ran genomeScope 2.0 on a data set of reads derived from a plant with a homozygous diploid genome of estimated size ~5.2 Gb. The sequencing depth is ~15x. I created the k-mer histogram using KMC3 according to the instructions on the web page http://qb.cshl.edu/genomescope/genomescope2.0/. The observed histogram looks as I would expect - single peak around the k-mer coverage. However, for some reason I see low fit and a ten fold underestimation of the genome size. Below you can see some results of different runs, but all seem to have the same problems. Looking at the plots, it seems that the full model just doesn't fit very well and the error model behaves pretty strange too. 21-mers - max 10000 http://genomescope.org/genomescope2.0/analysis.php?code=LHm4tnb11LQiysR7wJZC 21-mers - max 10000000 http://genomescope.org/genomescope2.0/analysis.php?code=1kp0l7h0qjT3124dkq8b 21-mers - max 10000, min 3 http://genomescope.org/genomescope2.0/analysis.php?code=e0fXBUVine9g3FPLpwJY 21-mers - max 10000, Max k-mer coverage=50 http://genomescope.org/genomescope2.0/analysis.php?code=cAfQtGhZFiEAf8OMeKcq

Am I doing something wrong, or is there something going on with my data? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/70, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP346RQPWY5NX223NQ2L3UWEU43ANCNFSM5MAW25MA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

GenomeScope version 2.0 input file = glick.hist output directory = kcov7.shift2.typicalerror3 p = 2 k = 21 initial kmercov estimate = 7

property min max
Homozygous (aa) 98.238% 100%
Heterozygous (ab) 0% 1.762%
Genome Haploid Length 2,499,455,365 bp 3,650,640,047 bp
Genome Repeat Length 1,559,853,296 bp 2,278,281,497 bp
Genome Unique Length 939,602,068 bp 1,372,358,550 bp
Model Fit 42.9915% 95.9251%
Read Error Rate 0.160559% 0.160559%

soungalo commented 2 years ago

Thanks, I'll use the CLI from now on.
Could you please explain the parameters --kcov 7 --start_shift 2 --typical_error 3 and how you chose them?
Also, I see you picked the "max" estimate. Is that a general recommendation, or just in this case because of the very low fit of the "min" estimate?

mschatz commented 2 years ago

Great.

--kcov is a "hint" to the model fitting for the coverage for the heterozygous peak. Your sample has very little heterozygosity, so I picked it by looking at the kmer profile, seeing the main peak had a coverage of about 15, and then dividing it in half. The values arent supposed to correspond with the very "top" of the peak since we fit negative binomials that have a larger standard deviation to account for PCR duplicates and other biases

--typical_error means when fitting the model, it should ignore kmers below this coverage as likely being sequencing errors. I fit it by picking a value around the local minimum between high coverage error kmers (with coverage 1 or 2) and the main peak from your data.

--start_shift is also used to exclude sequencing errors. The first round will try fitting excluding kmers below 3 (--typical_error), and then the next round will try fitting excluding below 5 (--typical _error + --start_shift), and then 7, and then 9. The model that has the smallest residual kmers is selected for the final report (the fewest kmers that are outside of the expected value from the fit model).

I used your "21-mers - max 10000" histogram just because it was the first one you had listed. The "21-mers - max 10000000" would probably give a slightly larger and slightly more accurate result.

Good luck!

Mike

On Mon, Jan 17, 2022 at 2:56 AM Lior Glick @.***> wrote:

Thanks, I'll use the CLI from now on. Could you please explain the parameters --kcov 7 --start_shift 2 --typical_error 3 and how you chose them? Also, I see you picked the "max" estimate. Is that a general recommendation, or just in this case because of the very low fit of the "min" estimate?

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/70#issuecomment-1014233733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP34Y373MT44XVTD3NEI3UWPDTLANCNFSM5MAW25MA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

soungalo commented 2 years ago

@mschatz - I am returning to this issue because I'm still having some trouble. Specifically, I am getting underestimated genome size.
I made a small experiment: I downloaded 40x data for a homozygous barley genome for which an assembly is available. The assembly size is ~4.5 Gb. I created 21-mers using KMC and ran them through genomeScope2. I tried different parameters as you suggested above, but still couldn't get a result which was even close to 4.5 Gb (histogram is attached). Am I doing something wrong? Maybe 40x is still not deep enough? Or maybe this genome is too complex/repetitive? Would appreciate any help.
Thanks! Akashinriki_dedup_21mer_cx10k.hist.txt

mschatz commented 2 years ago

Hi Lior,

You describe this as a 40x coverage dataset, but the central peak is at about 25x. Are you sure you have all of the data? Also the histogram is truncated at 10,000 coverage which will skip all of the very frequent repeats (TEs, centromeres, etc) that is likely to make up a large fraction of the barley genome. You will need to recompute the KMC histogram with a much larger cutoff (1,000,000 is a safe choice):

$ kmc_tools transform reads histogram reads.histo -cx1000000

Good luck!

Mike

On Sat, Mar 12, 2022 at 12:47 PM Lior Glick @.***> wrote:

@mschatz https://github.com/mschatz - I am returning to this issue because I'm still having some trouble. Specifically, I am getting underestimated genome size. I made a small experiment: I downloaded 40x data for a homozygous barley genome for which an assembly is available. The assembly size is ~4.5 Gb. I created 21-mers using KMC and ran them through genomeScope2. I tried different parameters as you suggested above, but still couldn't get a result which was even close to 4.5 Gb (histogram is attached). Am I doing something wrong? Maybe 40x is still not deep enough? Or maybe this genome is too complex/repetitive? Would appreciate any help. Thanks! Akashinriki_dedup_21mer_cx10k.hist.txt https://github.com/schatzlab/genomescope/files/8238071/Akashinriki_dedup_21mer_cx10k.hist.txt

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/70#issuecomment-1065928992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP34ZZUUA6QPRXA77G5KDU7TKCZANCNFSM5MAW25MA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

mschatz commented 2 years ago

Actually looking at this again, you will need to reduce both the counting and the histogram to use a cutoff of 1,000,000:

$ kmc -k21 -t10 -m64 -ci1 -cs1000000 @FILES reads tmp/

$ kmc_tools transform reads histogram reads.histo -cx1000000

Good luck! Mike

On Sat, Mar 12, 2022 at 1:07 PM Michael Schatz @.***> wrote:

Hi Lior,

You describe this as a 40x coverage dataset, but the central peak is at about 25x. Are you sure you have all of the data? Also the histogram is truncated at 10,000 coverage which will skip all of the very frequent repeats (TEs, centromeres, etc) that is likely to make up a large fraction of the barley genome. You will need to recompute the KMC histogram with a much larger cutoff (1,000,000 is a safe choice):

$ kmc_tools transform reads histogram reads.histo -cx1000000

Good luck!

Mike

On Sat, Mar 12, 2022 at 12:47 PM Lior Glick @.***> wrote:

@mschatz https://github.com/mschatz - I am returning to this issue because I'm still having some trouble. Specifically, I am getting underestimated genome size. I made a small experiment: I downloaded 40x data for a homozygous barley genome for which an assembly is available. The assembly size is ~4.5 Gb. I created 21-mers using KMC and ran them through genomeScope2. I tried different parameters as you suggested above, but still couldn't get a result which was even close to 4.5 Gb (histogram is attached). Am I doing something wrong? Maybe 40x is still not deep enough? Or maybe this genome is too complex/repetitive? Would appreciate any help. Thanks! Akashinriki_dedup_21mer_cx10k.hist.txt https://github.com/schatzlab/genomescope/files/8238071/Akashinriki_dedup_21mer_cx10k.hist.txt

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/70#issuecomment-1065928992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP34ZZUUA6QPRXA77G5KDU7TKCZANCNFSM5MAW25MA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

soungalo commented 2 years ago

Thanks, that did the trick. I tried increasing the max kmer coverage also for my low depth data set, but this didn't change the results, so my conclusion is that for barley, 10x sequencing is simply too low to distinguish the coverage and the error peaks.