schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
247 stars 56 forks source link

kmer_max_cov value setting #30

Open giulialopatriello95 opened 4 years ago

giulialopatriello95 commented 4 years ago

Hi there!

We used Jellyfish and GenomeScope on pair-end NGS data to perform k-mer analysis on a new sequenced species of unknown genome size and unknown ploidy nature.

When I've run Genomescope with default parameters(kmer_max_cov=1000) ,I got a genome size of 100 Mbp. http://qb.cshl.edu/genomescope/analysis.php?code=TARKVJSNnwnCrXHXRS29

Whereas for a kmer_max_cov=1000000 , the genome size was predicted to be ~243 Mbp. http://qb.cshl.edu/genomescope/analysis.php?code=9i6yLDWTB7ZO2u73Lof2

I've already seen the discussion in this link https://github.com/schatzlab/genomescope/issues/22 and I've seen that for high repetitive genome it is recommended to use 1M for kmer_max_cov. However, I was wondering if you have any clue on how to set this parameter.

Best regards, Giulia Lopatriello

mschatz commented 4 years ago

Hi Giulia,

I generally recommend setting kmer_max_cov=1000000 so that you can include highly repetitive kmers from the centromeres and other high copy repeats that would otherwise be left out. However in your data, it looks like there are some higher order peaks around 1e4 that might be inflating the genome size. These are often plasmid sequences (mitochondria genome, chloroplast) and/or contamination of some sort that can inflate the estimate a bit. Based on these plots, I would estimate the true haploid genome size to be around 220Mb.Does that seem plausible relatively to what you know about related species. If it seems off, we do have a new version that we could perhaps try for you

Hope this helps

Mike

On Tue, Mar 3, 2020 at 10:19 AM giulialopatriello95 < notifications@github.com> wrote:

Hi there!

We used Jellyfish and GenomeScope on pair-end NGS data to perform k-mer analysis on a new sequenced species of unknown genome size and unknown ploidy nature.

When I've run Genomescope with default parameters(kmer_max_cov=1000) ,I got a genome size of 100 Mbp. http://qb.cshl.edu/genomescope/analysis.php?code=TARKVJSNnwnCrXHXRS29

Whereas for a kmer_max_cov=1000000 , the genome size was predicted to be ~243 Mbp. http://qb.cshl.edu/genomescope/analysis.php?code=9i6yLDWTB7ZO2u73Lof2

I've already seen the discussion in this link #22 https://github.com/schatzlab/genomescope/issues/22 and I've seen that for high repetitive genome it is recommended to use 1M for kmer_max_cov. However, I was wondering if you have any clue on how to set this parameter.

Best regards, Giulia Lopatriello

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/30?email_source=notifications&email_token=AABP345CWXAWB6F376HB7KDRFUNZ5A5CNFSM4LAMZIZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ISB36SQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP3422D3PMXCOY3A6MM6DRFUNZ5ANCNFSM4LAMZIZA .

giulialopatriello95 commented 4 years ago

Thank you for your suggestion! I would like to test the new version on my data. Best regards, Giulia Lopatriello

mschatz commented 4 years ago

Thanks for your interest. The new version is available here: http://genomescope.org/genomescope2.0

The main new feature is support for higher ploidy genomes. It works very similar to version 1, although the model fit is usually better, including for diploids. We have a preprint describing the new methods here: https://www.biorxiv.org/content/10.1101/747568v1

The final manuscript is in press at Nature Communications and should appear online within the next week or so.

Good luck! Mike

On Mon, Mar 9, 2020 at 4:27 AM giulialopatriello95 notifications@github.com wrote:

Thank you for your suggestion! I would like to test the new version on my data. Best regards, Giulia Lopatriello

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/30?email_source=notifications&email_token=AABP344Q5DXLLJOHQW4UTFTRGSR6JA5CNFSM4LAMZIZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOGDXLQ#issuecomment-596392878, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP34ZWNWCABILOOQ5VOS3RGSR6JANCNFSM4LAMZIZA .