I am a user of the genomescope software and have encountered some problems and seek your help. I hope to get your reply.

wjw-yj commented 1 month ago

transformed_log_plot summary.txt Thank you so much for developing such great software! Recently，I’ve been using the Next-generation Sequencing data to perform a genome survey by Jellyfish+genomescope，using k-mer 21 and k-mer 41, but the results seem to differ significantly from my expectations. The species I study are locusts with huge genomes. And we have 300Gb of data on both ends of our second generation sequencing (depth of coverage >50x) . I suspect it may be due to the amount of data being beyond the software's processing range as well as too much heterozygosity in the second-generation data. Here are the commands I used and the results of the visualisation. jellyfish count -m 17 -s 20G -t 60 -o kmer41.out -C Unknown_BJ731-02R0001_good_1.fq Unknown_BJ731-02R0001_good_2.fq jellyfish histo kmer41.out -o kmer41.histo genomescope2.0 -i kmer41.histo -o genomescope -p 2 -k 41 -m 1000000 And here are the transformed_linear_plot.png： and transformed_log_plot.png： I hope to hear back from you to help me with these issues.

mschatz commented 1 month ago

Thanks for your interest! This is reporting a fairly large genome size of 7.2Gb for the haploid genome size, although I see some locusts have genome sizes that are more than 8Gb.I noticed your kmer distribution is truncated at 10,000 so it is underreporting some of the very high frequency kmers which can lead to underreporting genome sizes. You will need to rerun the jellyfish histo command to account for the very high frequency kmers (-m 10000000). If this doesnt change the plot you may need to rerun the count command too with -U

Good luck

Mike

On Fri, Sep 27, 2024 at 5:07 AM wjw-yj @.***> wrote:

transformed_log_plot.png (view on web) https://github.com/user-attachments/assets/fb237975-232e-4bcd-9518-a817c99c7377 summary.txt https://github.com/user-attachments/files/17161668/summary.txt Thank you so much for developing such great software! Recently，I’ve been using the Next-generation Sequencing data to perform a genome survey by Jellyfish+genomescope，using k-mer 21 and k-mer 41, but the results seem to differ significantly from my expectations. The species I study are locusts with huge genomes. And we have 300Gb of data on both ends of our second generation sequencing (depth of coverage >50x) . I suspect it may be due to the amount of data being beyond the software's processing range as well as too much heterozygosity in the second-generation data. Here are the commands I used and the results of the visualisation. jellyfish count -m 17 -s 20G -t 60 -o kmer41.out -C Unknown_BJ731-02R0001_good_1.fq Unknown_BJ731-02R0001_good_2.fq jellyfish histo kmer41.out -o kmer41.histo genomescope2.0 -i kmer41.histo -o genomescope -p 2 -k 41 -m 1000000 And here are the transformed_linear_plot.png： transformed_linear_plot.png (view on web) https://github.com/user-attachments/assets/9255bd6b-c691-44bb-9b02-dfe785d1f855 and transformed_log_plot.png： transformed_log_plot.png (view on web) https://github.com/user-attachments/assets/00b0a946-1b97-4a27-8fbf-e0b9ceb10241 I hope to hear back from you to help me with these issues.

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343EJWZL3SVLX5D5OQTZYUN65AVCNFSM6AAAAABO6XJQB2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2TENBRGE4TGMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

wjw-yj commented 1 month ago

Thank you for being able to get back to me in a timely manner, I changed the commands as per your instructions but the results are still not satisfactory. Could you please give me more guidance on how to solve this issue, preferably in detail on how to modify the command. Looking forward to your reply. jellyfish count -U 64 -m 41 -s 20G -t 60 -o kmer41.out -C Unknown_good_1.fq Unknown_good_2.fq jellyfish histo -h 1000000 kmer41.out -o kmer41.histo genomescope2.0 -i kmer41.histo -o genomescope -p 2 -k 41 -m 100000000 transformed_linear_plot transformed_log_plot

mschatz commented 1 month ago

What is your concern? Note that with the extra high frequency kmers the estimated genome size has increased to 8.8Gbp. Note this is the haploid genome size so the total DNA contents will be close to 18Gbp

Hope this helps

Mike

On Sun, Sep 29, 2024 at 5:26 AM wjw-yj @.***> wrote:

Thank you for being able to get back to me in a timely manner, I changed the commands as per your instructions but the results are still not satisfactory. Could you please give me more guidance on how to solve this issue, preferably in detail on how to modify the command. Looking forward to your reply. jellyfish count -U 64 -m 41 -s 20G -t 60 -o kmer41.out -C Unknown_good_1.fq Unknown_good_2.fq jellyfish histo -h 1000000 kmer41.out -o kmer41.histo genomescope2.0 -i kmer41.histo -o genomescope -p 2 -k 41 -m 100000000 transformed_linear_plot.png (view on web) https://github.com/user-attachments/assets/3508f8fc-e9b2-41ab-b0dd-ef0cfece5874 transformed_log_plot.png (view on web) https://github.com/user-attachments/assets/fb295c63-410b-4085-8af0-47da9796e223

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/143#issuecomment-2381281011, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP344D2ZMGEH7MK5DK3A3ZY7BUFAVCNFSM6AAAAABO6XJQB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRGI4DCMBRGE . You are receiving this because you commented.Message ID: @.***>

wjw-yj commented 1 month ago

My concern is that the plot displayed by transformed_linear_plot is trustworthy, the blue vertical line does not converge under the black curve and the black and yellow lines do not overlap (is this an error), can such a plot be applied to a paper? It seems to me that this is the correct graph, and the genome size is more plausible.Even if it shows a single peak. ![Uploading b6bdacb3020a4c17b4f13df6b5e175e4.png…]()

mschatz commented 1 month ago

It is not uncommon for the fit of the statistical model to deviate from the observed kmer counts, especially when you have low coverage and a very large genome as you have here. This can skew the estimated heterozygosity rate so that it is probably a bit higher than reported but it should have little impact on the estimated genome size.

I did notice your file names have good in them? Does that mean you trimmed the reads already? It sometimes help to give the raw sequencing reads before trimming. Otherwise, there is not much more than can be done, expect for sequencing to deeper sequencing depth.

Good luck

Mike

On Sun, Sep 29, 2024 at 10:46 PM wjw-yj @.***> wrote:

My concern is that the plot displayed by transformed_linear_plot is trustworthy, the blue vertical line does not converge under the black curve and the black and yellow lines do not overlap (is this an error), can such a plot be applied to a paper? It seems to me that this is the correct graph, and the genome size is more plausible.Even if it shows a single peak. [image: Uploading b6bdacb3020a4c17b4f13df6b5e175e4.png…]

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/143#issuecomment-2381897080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP3443DPMFQLBUL2NXRM3ZZC3OXAVCNFSM6AAAAABO6XJQB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRHA4TOMBYGA . You are receiving this because you commented.Message ID: @.***>

wjw-yj commented 1 month ago

My second-generation sequencing data has a depth of coverage of about 50x, which is considered high coverage, and all of the data I use is raw and has not been trimmed in any way. If, as you say, the low fit of the curves does not affect the genome size, can my resulting plot be inserted into the article?

mschatz commented 1 month ago

Yes you can include them. Have you tried a shorter kmer size such as 21? Your plots so far show a peak in the kmer coverage at about 10x coverage. Kmer coverage depends on the length of the kmer and the length of the reads. If you are using k=41 for 100bp reads, your kmer coverage will be reduced by 41% relative to the sequencing coverage, so it looks like you have about 20x sequence coverage, not 50x.

Good luck

Mike

On Sun, Sep 29, 2024 at 11:02 PM wjw-yj @.***> wrote:

My second-generation sequencing data has a depth of coverage of about 50x, which is considered high coverage, and all of the data I use is raw and has not been trimmed in any way. If, as you say, the low fit of the curves does not affect the genome size, can my resulting plot be inserted into the article?

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/143#issuecomment-2381909197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP346BDJGTL4ZTGPV2LR3ZZC5KVAVCNFSM6AAAAABO6XJQB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRHEYDSMJZG4 . You are receiving this because you commented.Message ID: @.***>

wjw-yj commented 1 month ago

So you're still insisting that this results in insufficient depth of coverage of my data, and that just changing the parameters won't solve these problems.

mschatz commented 1 month ago

Im just trying to help you

Good luck!

Mike

On Sun, Sep 29, 2024 at 11:22 PM wjw-yj @.***> wrote:

So you're still insisting that this results in insufficient depth of coverage of my data, and that just changing the parameters won't solve these problems.

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/143#issuecomment-2381926354, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343LAOH4U224WXYRY53ZZC7Y5AVCNFSM6AAAAABO6XJQB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBRHEZDMMZVGQ . You are receiving this because you commented.Message ID: @.***>

wjw-yj commented 1 month ago

Ok, thank you very much for your patience in replying, these help me a lot, I will try to debug some more parameters to make the results look as perfect as possible, and of course if the depth of coverage is the reason, I will consider adding the amount of data measured.

schatzlab / genomescope

I am a user of the genomescope software and have encountered some problems and seek your help. I hope to get your reply. #143