schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
254 stars 57 forks source link

Incorrect peak identification in TE-rich genome #13

Closed Adamtaranto closed 2 years ago

Adamtaranto commented 6 years ago

I'm working with a very repeat-rich fungal genome, with expected haploid genome size around 70 - 80 Mb. Genoscope seems to be locking onto the second (repeat-associated) peak and is producing a much lower than expected genome size estimate.

Here are the kmer coverage counts from the jellyfish histo file, trimmed at 4 and with the manually identified peak at 8. A second peak occurs between counts ~30-110. Jellyfish_K25.pdf

Genoscope is lumping the primary peak in with the low coverge "error" segment. k25_genomescopea k25_genomescopeb

Histo file: 25mer_out.txt

Would it be possible to force Genoscope to fit a model around a user-specified peak?

mschatz commented 6 years ago

Hi Adam,

Thanks for your note. We looked into this but are a little confused with what we see. We agree with you that there does appear to be a local maximum at coverage 8, but then the next clear peak is at around 68. If the baseline coverage really is 8, that would imply most of the genome is organized into ~8 copy nearly perfect repeats. We have never seen anything like that before. I wonder if it could be something more straightforward -- the genomescope model reports haploid genome size, but I wonder if your estimates are of the diploid genome size. For example in human, genomescope would report the haploid genome size at about 3Gb, but the diploid genome size is 6Gb. Could this explain it? Then I would expect the peak at 8x is an artifact of sequencing error and the trimming you have used.

Good luck!

Mike

On Wed, May 2, 2018 at 6:00 PM Adam Taranto notifications@github.com wrote:

I'm working with a very repeat-rich fungal genome, with expected haploid genome size around 70 - 80 Mb. Genoscope seems to be locking onto the second (repeat-associated) peak and is producing a much lower than expected genome size estimate.

Here are the kmer coverage counts from the jellyfish histo file, trimmed at 4 and with the manually identified peak at 8. A second peak occurs between counts ~30-110. Jellyfish_K25.pdf https://github.com/schatzlab/genomescope/files/1968989/Jellyfish_K25.pdf

Genoscope is lumping the primary peak in with the low coverge "error" segment. [image: k25_genomescopea] https://user-images.githubusercontent.com/2160099/39550585-a01f5646-4e16-11e8-8d9f-f48c5a54b5e6.png [image: k25_genomescopeb] https://user-images.githubusercontent.com/2160099/39550586-a0372bf4-4e16-11e8-8a30-f8528180dab6.png

Histo file: 25mer_out.txt https://github.com/schatzlab/genomescope/files/1968988/25mer_out.txt

Would it be possible to force Genoscope to fit a model around a user-specified peak?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL983ITpiAHnG8muC3E_yE59h88s1Wsks5tuixlgaJpZM4TwOaN .

Adamtaranto commented 6 years ago

Hi Mike,

I've managed to get something more sensible by lowering the TYPICAL_ERROR to "2" and START_SHIFT to "1". This seems to prevent the peak at 8x from being ignored when looking for the homozygous peak. I confess that this is supposed to be a haploid species, though I was hoping the genoscope estimate would be ~ n/2. The new unique length is ~38.4 Mb which is indeed approx half the expected genome size.

plot

Do you have any suggestions re modifying the nls() call to fit just two negative binomial distributions?(i.e. single copy peak , and repeat peak).

Thanks for writing a really great tool!

mschatz commented 6 years ago

That explains it -- If you are certain the genome is haploid, I would would just double the estimate from the default genomescope run that estimated the genome length to be 31Mbp as the current model assumes the genome is diploid.

Good luck!

Mike

On Fri, May 4, 2018 at 4:56 PM Adam Taranto notifications@github.com wrote:

Hi Mike,

I've managed to get something more sensible by lowering the TYPICAL_ERROR to "2" and START_SHIFT to "1". This seems to prevent the peak at 8x from being ignored when looking for the homozygous peak. I confess that this is supposed to be a haploid species, though I was hoping the genoscope estimate would be ~ n/2. The new unique length is ~38.4 Mb which is indeed approx half the expected genome size.

[image: plot] https://user-images.githubusercontent.com/2160099/39651232-f1a13dec-4f9f-11e8-9897-55d36c79f989.png

Do you have any suggestions re modifying the nls() call to fit just two negative binomial distributions?(i.e. single copy peak , and repeat peak).

Thanks for writing a really great tool!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/13#issuecomment-386732360, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL987CTG29qohKmjlPtJJNJaa1IpkJGks5tvMCZgaJpZM4TwOaN .

Unalibun commented 5 years ago

Hi Mike, can you tell me what does means the purple lines in my genome_scope graph? plot log

mschatz commented 5 years ago

It is supposed to highlight error kmers, meaning the kmers that are not explained by the model fit.

We actually have a new version of genomescope that generally does a better job with the fit. It works pretty much the same as before, although it now has support for higher ploidy values (in addition to diploid samples). Can you give it a try: http://genomescope.org/genomescope2.0/

Good luck

Mike

On Tue, Sep 17, 2019 at 5:51 AM unalibun notifications@github.com wrote:

Hi Mike, can you tell me what does means the purple lines in my genome_scope graph? [image: plot log] https://user-images.githubusercontent.com/51210427/65031388-77f29900-d941-11e9-98a0-b9cccb30e1b1.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/13?email_source=notifications&email_token=AABP34Y57LQKNXRFPQXF6FLQKCSDTA5CNFSM4E6A42G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD637HUI#issuecomment-532149201, or mute the thread https://github.com/notifications/unsubscribe-auth/AABP346ET42LCAZQHUIL2NLQKCSDTANCNFSM4E6A42GQ .

Unalibun commented 5 years ago

Thank you so much, I will work on it!

dlata123 commented 4 years ago

Hello Sir,

I hope you are doing well. Screenshot (154) Screenshot (155) Screenshot (156)

I am trying to use genome scope in order to know the repeat content of the genome. I am having a hard time understanding the output. Can you please please help me out with this. The results I obtained are attached below.

I ran the following command in order to get the results.

  1. download jellyfish:

wget https://github.com/gmarcais/Jellyfish/releases/download/v2.3.0/jellyfish-2.3.0.tar.gz

  1. untar the file

tar -xvzf jellyfish-2.3.0.tar.gz

  1. download the linux one too.

wget https://github.com/gmarcais/Jellyfish/releases/download/v2.3.0/jellyfish-linux

  1. went to the jellyfish directory.

cd jellyfish-2.3.0/

  1. ran count to find kmers:

./bin/jellyfish count - C -m 21 -s 1000000000 -t 10 *.sra.fasta -o reads.jf

6.make histo file and dragged it to the site.

./jellyfish-2.3.0/bin/jellyfish histo -t 10 reads.jf > reads.histo

Your suggestion will really helpful.