schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
258 stars 57 forks source link

some questions about the principle of genomescope #78

Open ChenDepp opened 2 years ago

ChenDepp commented 2 years ago

hi, I have read the genomescope and genomescope2 article, but there is no specific method and principle in it, so i still confused it! If I want to study genomscope the rationale behind it, can you provide some information? thanks

mschatz commented 2 years ago

I think the GenomeScope 1 paper presents the key ideas pretty clearly:

Here we introduce GenomeScope to estimate the overall genome characteristics (total and haploid genome length, percentage of repetitive content and heterozygosity rate) as well as overall read characteristics (read coverage, read duplication and error rate) from raw short read sequencing data. The estimates do not require a reference genome and they can be automatically inferred via a statistical analysis of the k-mer profile. The k-mer profile (sometimes called k-mer spectrum) measures how often k-mers, substrings of length k, occur in the sequencing reads and can be computed quickly using tools such as Jellyfish (Marcais and Kingsford, 2011) or approximated using faster streaming approaches (Melsted and Halldorsson, 2014). The profiles reflect the complexity of the genome: homozygous genomes have a simple Poisson profile while heterozygous ones have a characteristic bimodal profile (Kajitani et al., 2014). Repeats add additional peaks at even higher k-mer frequencies, while sequencing errors and read duplications distort the profiles with low frequency false k-mers and increased variances (Kelley et al., 2010; Miller et al., 2011).

Aware of these possible complexities, GenomeScope fits a mixture model of four evenly spaced negative binomial distributions to the k-mer profile to measure the relative abundances of heterozygous and homozygous, unique and two-copy sequences (Supplementary Eq. S2). GenomeScope uses a mixture model of negative binomial model terms rather than Poisson terms since real sequencing data is often over-dispersed compared to a Poisson distribution (Miller et al., 2011). The model fitting is computed using a non-linear least squares estimate as implemented by the nls function in R (Bates and Watts, 1988). To make the model fitting more robust, GenomeScope attempts several rounds of model fitting excluding different fractions of low frequency k-mers that are likely caused by sequencing errors, and adjusting for the ambiguity in determining the correct heterozygous and homozygous peak. The final set of parameters is selected as those parameters that minimize the residual sum of squares errors (RSSE) of the model relative to the observed k-mer profile. Afterwards, sequence errors and higher copy repeats are identified by k-mers falling outside the model range, and the total genome size is estimated by normalizing the observed k-mer frequencies to the average coverage value for homozygous sequences, excluding likely sequencing errors. See Supplementary Note S1 for a detailed description of the model and fitting procedure.

Do you have any specific questions?

Mike

On Thu, Jun 16, 2022 at 3:51 AM ChenDepp @.***> wrote:

hi, I have read the genomescope and genomescope2 article, but there is no specific method and principle in it, so i still confused it! If I want to study genomscope the rationale behind it, can you provide some information? thanks

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/78, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP34YHWPX4QBW6OWMO37DVPLMGNANCNFSM5Y53XAKQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ChenDepp commented 2 years ago

hi @mschatz thank you very much for your reply your detailed answer has helped me a lot. I'm interested in genome survey (the rationale behind it). i had read the gce(genomic character estimator) and genomescope article. found the difference rationale between two software. but I can't get any information or paper on comparing the results of the two software. i had read the gce source code, at now i want read the genomescope source code, may be this is a good way for me to learn it more deeply. When I have a problem reading the source code, I will ask you again。 thank you very much @mschatz

mschatz commented 2 years ago

Sure, I would start with the GenomeScope 1.0 sourcecode as it it more straightforward to understand: https://github.com/schatzlab/genomescope/blob/master/genomescope.R

The supplement also describes the model in more detail, especially Supplementary Note 1:

https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/33/14/10.1093_bioinformatics_btx153/4/btx153_supp.zip?Expires=1658488184&Signature=ih2B5YOwqeOHvjuQEre6SF7-ErW-x6J8yy6yRQIZ4l5qdDCE5Diar~U1pL8uEpg136xAcy3ozuBq9J5wAjpnWuAHnnhyqG7LhV-rBrLQYAh-4mtAU~Y997P37PEBHdzBg~sVW1nVRfdzvqB0FwloRKHE8ZnHVqgGbwI1PjNZdcV7nie03SjA8yIesX1Fjga0Ug7MMkPhMKEACdHEXXJ0~DdqIAjwAIzvFtSxVZbIEBy5D3fEtlWiyxcRHKjzCRdIpVcu8jJQqNqdCkFLacUaZXSOQRnVkaaGMXKbi3q1GSpccXJyKCwYiuNYcxUjkI8d4P3WCWjx9Hf3JliJUCf8xA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA

Good luck

Mike

On Fri, Jun 17, 2022 at 12:21 AM ChenDepp @.***> wrote:

hi @mschatz https://github.com/mschatz thank you very much for your reply your detailed answer has helped me a lot. I'm interested in genome survey (the rationale behind it). i had read the gce(genomic character estimator) and genomescope article. found the difference rationale between two software. but I can't get any information or paper on comparing the results of the two software. i had read the gce source code, at now i want read the genomescope source code, may be this is a good way for me to learn it more deeply. When I have a problem reading the source code, I will ask you again。 thank you very much @mschatz https://github.com/mschatz

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/78#issuecomment-1158474981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP342FKQTYXM4HFSZZFU3VPP4L7ANCNFSM5Y53XAKQ . You are receiving this because you were mentioned.Message ID: @.***>