schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
250 stars 56 forks source link

I am a user of the genomescope software and have encountered some problems and seek your help. I hope to get your reply. #143

Open wjw-yj opened 1 day ago

wjw-yj commented 1 day ago

transformed_log_plot summary.txt Thank you so much for developing such great software! Recently,I’ve been using the Next-generation Sequencing data to perform a genome survey by Jellyfish+genomescope,using k-mer 21 and k-mer 41, but the results seem to differ significantly from my expectations. The species I study are locusts with huge genomes. And we have 300Gb of data on both ends of our second generation sequencing (depth of coverage >50x) . I suspect it may be due to the amount of data being beyond the software's processing range as well as too much heterozygosity in the second-generation data. Here are the commands I used and the results of the visualisation. jellyfish count -m 17 -s 20G -t 60 -o kmer41.out -C Unknown_BJ731-02R0001_good_1.fq Unknown_BJ731-02R0001_good_2.fq jellyfish histo kmer41.out -o kmer41.histo genomescope2.0 -i kmer41.histo -o genomescope -p 2 -k 41 -m 1000000 And here are the transformed_linear_plot.png: transformed_linear_plot and transformed_log_plot.png: transformed_log_plot I hope to hear back from you to help me with these issues.

mschatz commented 8 hours ago

Thanks for your interest! This is reporting a fairly large genome size of 7.2Gb for the haploid genome size, although I see some locusts have genome sizes that are more than 8Gb.I noticed your kmer distribution is truncated at 10,000 so it is underreporting some of the very high frequency kmers which can lead to underreporting genome sizes. You will need to rerun the jellyfish histo command to account for the very high frequency kmers (-m 10000000). If this doesnt change the plot you may need to rerun the count command too with -U

Good luck

Mike

On Fri, Sep 27, 2024 at 5:07 AM wjw-yj @.***> wrote:

transformed_log_plot.png (view on web) https://github.com/user-attachments/assets/fb237975-232e-4bcd-9518-a817c99c7377 summary.txt https://github.com/user-attachments/files/17161668/summary.txt Thank you so much for developing such great software! Recently,I’ve been using the Next-generation Sequencing data to perform a genome survey by Jellyfish+genomescope,using k-mer 21 and k-mer 41, but the results seem to differ significantly from my expectations. The species I study are locusts with huge genomes. And we have 300Gb of data on both ends of our second generation sequencing (depth of coverage >50x) . I suspect it may be due to the amount of data being beyond the software's processing range as well as too much heterozygosity in the second-generation data. Here are the commands I used and the results of the visualisation. jellyfish count -m 17 -s 20G -t 60 -o kmer41.out -C Unknown_BJ731-02R0001_good_1.fq Unknown_BJ731-02R0001_good_2.fq jellyfish histo kmer41.out -o kmer41.histo genomescope2.0 -i kmer41.histo -o genomescope -p 2 -k 41 -m 1000000 And here are the transformed_linear_plot.png: transformed_linear_plot.png (view on web) https://github.com/user-attachments/assets/9255bd6b-c691-44bb-9b02-dfe785d1f855 and transformed_log_plot.png: transformed_log_plot.png (view on web) https://github.com/user-attachments/assets/00b0a946-1b97-4a27-8fbf-e0b9ceb10241 I hope to hear back from you to help me with these issues.

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343EJWZL3SVLX5D5OQTZYUN65AVCNFSM6AAAAABO6XJQB2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2TENBRGE4TGMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>