schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
253 stars 56 forks source link

Failed to converge #19

Closed jdmontenegro closed 2 years ago

jdmontenegro commented 5 years ago

Hello everyone, This tool looks quite useful since I was about to start writing some R code with the same purpose. To get familiar with the program I simulated illumina data from a small plant diploid genome (genome size 13Mbp) and produced histograms for 17, 19, 21, 23 and 25 mers with kmc3 and jellyfish. The read simulation uses Illumina error profile to generate the reads and produces around 40X coverage of the haploid genome. All kmer sizes histograms fail to converge even though I can clearly see two peaks in my data (see data below). plot plot log

The command line being used is the following:

for kmer in {17,19,21,23,25}
do
  Rscript genoscope.R ${kmer}.hg ${kmer} 100 Genscope_${kmer}
done

I would like to test this tool in real data, but I am afraid that I won be using it if it cannot handle simple simulated data. Any help would be more than welcome.

BTW, KMC3 histogram requires replacement of "\t" for spaces in order to be compatible with current version of Genoscope.

Kind regards,

Juan D. Montenegro

mschatz commented 5 years ago

Hi Juan,

Could you send me your kmer histogram file? There might be other formatting issues at play since we typically use Jellyfish instead of KMC3

Cheers

Mike

jdmontenegro commented 5 years ago

Hi Mike,

Thanks for your quick reply. Please find attached the histogram for k=21. It was obtained with jellyfish as suggested for Genoscope. As I mentioned before I tried both KMC3 and Jellyfish and both failed to converge. sim.k21.hg.txt Regards, Juan D.

mschatz commented 5 years ago

Thanks for sending. Rhyker, could you try this with the new model using the yx or yx^2 models? I think the high rate of heterozygosity is confusing the GenomeScope 1.0 model

Thanks!

Mike

On Tue, Mar 19, 2019 at 12:24 AM jdmontenegro notifications@github.com wrote:

Hi Mike,

Thanks for your quick reply. Please find attached the histogram for k=21. It was obtained with jellyfish as suggested for Genoscope. As I mentioned before I tried both KMC3 and Jellyfish and both failed to converge. sim.k21.hg.txt https://github.com/schatzlab/genomescope/files/2981515/sim.k21.hg.txt Regards, Juan D.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/19#issuecomment-474193240, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL989MkrKWqlGdM5QH6boNZd5NtEKqGks5vYGZigaJpZM4b6sm4 .

tbenavi1 commented 5 years ago

Juan, I have attached the results for your histogram file. The new version of GenomeScope should be able to easily handle your data. This version should be available very soon for you to try. sim.k21.hg_model.txt sim k21 hg_plot log sim k21 hg_plot sim.k21.hg_progress.txt sim.k21.hg_summary.txt

jdmontenegro commented 5 years ago

Hi guys, Thank you for your quick reply. I just have a question that may be silly. Is the model predicting the haploid genome size to be 6Mbp? Because I am expecting a 600 - 800 Mbp genome, based on flow cytometry and close relativesassemblies. Could you help me understand this better? On another subject, is the model predicting between 11-13% of the genome to be heterozygous? and 0.41% to be repetitive? Thank you again for your help. I look forward to hearing back from you.

Kind regards,

Juan Daniel

mschatz commented 5 years ago

In your first email you said this was a simulated data set from "a small plant diploid genome (genome size 13Mbp)". Does 13Mbp represent one or both haplotypes? For example, the haploid genome size of human is 3Gbp, but the total diploid genome size is 6Gb). The size that is reported here (6.5Mbp) is the haploid genome size, and it infers a heterozygosity rate of 13.1% and a very low rate of repetitive sequences (0.41%).

If you send the histogram for the real dataset (~600Mbp) we can help look at that too.

Cheers

Mike

On Fri, Mar 22, 2019 at 7:04 PM jdmontenegro notifications@github.com wrote:

Hi guys, Thank you for your quick reply. I just have a question that may be silly. Is the model predicting the haploid genome size to be 6Mbp? Because I am expecting a 600 - 800 Mbp genome, based on flow cytometry and close relativesassemblies. Could you help me understand this better? On another subject, is the model predicting between 11-13% of the genome to be heterozygous? and 0.41% to be repetitive? Thank you again for your help. I look forward to hearing back from you.

Kind regards,

Juan Daniel

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/19#issuecomment-475810313, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL983c_ZwVfuKDYZFJrgqgOdFoQlWI3ks5vZWF3gaJpZM4b6sm4 .

jdmontenegro commented 5 years ago

Dear Michael,

Apologies for the confusion. You are right, the test data was simulated from a 13Mbp reference genome. I was expecting a homozygous genome since I did not simulate heterozygosity. I will double check the reference genome to see if it was in fact a phased assembly which would explain the results.

I look forward to using the new version on the actual 600Mbp genome I am working on.

Kind regards,

Juan D. Montenegro

El dom., 24 mar. 2019 a las 19:22, Michael Schatz (notifications@github.com) escribió:

In your first email you said this was a simulated data set from "a small plant diploid genome (genome size 13Mbp)". Does 13Mbp represent one or both haplotypes? For example, the haploid genome size of human is 3Gbp, but the total diploid genome size is 6Gb). The size that is reported here (6.5Mbp) is the haploid genome size, and it infers a heterozygosity rate of 13.1% and a very low rate of repetitive sequences (0.41%).

If you send the histogram for the real dataset (~600Mbp) we can help look at that too.

Cheers

Mike

On Fri, Mar 22, 2019 at 7:04 PM jdmontenegro notifications@github.com wrote:

Hi guys, Thank you for your quick reply. I just have a question that may be silly. Is the model predicting the haploid genome size to be 6Mbp? Because I am expecting a 600 - 800 Mbp genome, based on flow cytometry and close relativesassemblies. Could you help me understand this better? On another subject, is the model predicting between 11-13% of the genome to be heterozygous? and 0.41% to be repetitive? Thank you again for your help. I look forward to hearing back from you.

Kind regards,

Juan Daniel

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/schatzlab/genomescope/issues/19#issuecomment-475810313 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAL983c_ZwVfuKDYZFJrgqgOdFoQlWI3ks5vZWF3gaJpZM4b6sm4

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/19#issuecomment-476017299, or mute the thread https://github.com/notifications/unsubscribe-auth/AI8luluGJ6-HyAkJa8semi_pg-HP_VLrks5vaBbSgaJpZM4b6sm4 .

mschatz commented 5 years ago

If the data are supposed to be haploid, we can also use the ploidy=1 model which will look for a single main peak for the unique kmers at ~45x, and then the higher peak at ~90x will represent 2 copy repeats.

Good luck!

Mike

On Sun, Mar 24, 2019 at 11:11 PM jdmontenegro notifications@github.com wrote:

Dear Michael,

Apologies for the confusion. You are right, the test data was simulated from a 13Mbp reference genome. I was expecting a homozygous genome since I did not simulate heterozygosity. I will double check the reference genome to see if it was in fact a phased assembly which would explain the results.

I look forward to using the new version on the actual 600Mbp genome I am working on.

Kind regards,

Juan D. Montenegro

El dom., 24 mar. 2019 a las 19:22, Michael Schatz (< notifications@github.com>) escribió:

In your first email you said this was a simulated data set from "a small plant diploid genome (genome size 13Mbp)". Does 13Mbp represent one or both haplotypes? For example, the haploid genome size of human is 3Gbp, but the total diploid genome size is 6Gb). The size that is reported here (6.5Mbp) is the haploid genome size, and it infers a heterozygosity rate of 13.1% and a very low rate of repetitive sequences (0.41%).

If you send the histogram for the real dataset (~600Mbp) we can help look at that too.

Cheers

Mike

On Fri, Mar 22, 2019 at 7:04 PM jdmontenegro notifications@github.com wrote:

Hi guys, Thank you for your quick reply. I just have a question that may be silly. Is the model predicting the haploid genome size to be 6Mbp? Because I am expecting a 600 - 800 Mbp genome, based on flow cytometry and close relativesassemblies. Could you help me understand this better? On another subject, is the model predicting between 11-13% of the genome to be heterozygous? and 0.41% to be repetitive? Thank you again for your help. I look forward to hearing back from you.

Kind regards,

Juan Daniel

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/schatzlab/genomescope/issues/19#issuecomment-475810313

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AAL983c_ZwVfuKDYZFJrgqgOdFoQlWI3ks5vZWF3gaJpZM4b6sm4

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/schatzlab/genomescope/issues/19#issuecomment-476017299 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AI8luluGJ6-HyAkJa8semi_pg-HP_VLrks5vaBbSgaJpZM4b6sm4

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/19#issuecomment-476040282, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL98wdtL1QjgE9D2fxjT1k2PP0UiRc4ks5vaD5WgaJpZM4b6sm4 .