schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
241 stars 56 forks source link

small bump explanation and -l parameter #103

Open amv33576 opened 1 year ago

amv33576 commented 1 year ago

Hello,

I have HiFi data for a diploid insect, with an estimated genome size of ~425 Mb according to flow cytometry data. I used Kraken to filter reads and make sure there were no adapters in the reads. Then, I used kmc for the counts and get the frequency data:

kmc -k21 -t20 -m64 out.fastq 21mer ./kmc_tmp/ for the counts kmc_tools transform 21mer histogram 21mer.histo -cx255

I used cx 255, because I noticed after that number, everything was zero.

Finally, I run in the conda environment: genomescope2 -i 21mer.histo -o gs.kraken.def -k 21 --testing -m 1000. That resulted in panel A below. The predicted genome size seems close to what I expected, but genomescope calculated kcov=34.3 leaves me with three peaks.

genoscope2_output

If I force the program to ignore the first peak and set -l 50, then I get ~ half of the genome size, as seen in Panel B. Also, the yellow line predicting the "unique sequence" are striking different between both panels as well.

So, what could explain these distributions, especially in panel A, where three peaks are observed? Is it peak 1-2 (34.3 and 68.6), one very broad peak? If yes, what could explain, or are 1-2 the Heterozygous/Homozygous peaks, then what is peak 3 representing?

I appreciate leads to understand these plots and get a better idea of the nature of the data.

mschatz commented 11 months ago

It is very, very surprising to me that there are no kmers with coverage above 255. With a mode coverage of about 50x, this means that there are no repeats that occur more than 5 times in the genome which is shocking for an insect. Can you try running kmc on the raw HiFI data? Perhaps something went off with the kraken filtering

Good luck

Mike

On Tue, Jul 25, 2023 at 9:21 PM amv33576 @.***> wrote:

Hello,

I have HiFi data for a diploid insect, with an estimated genome size of ~425 Mb according to flow cytometry data. I used Kraken to filter reads and make sure there were no adapters in the reads. Then, I used kmc for the counts and get the frequency data:

kmc -k21 -t20 -m64 out.fastq 21mer ./kmc_tmp/ for the counts kmc_tools transform 21mer histogram 21mer.histo -cx255

I used cx 255, because I noticed after that number, everything was zero.

Finally, I run in the conda environment: genomescope2 -i 21mer.histo -o gs.kraken.def -k 21 --testing -m 1000. That resulted in panel A below. The predicted genome size seems close to what I expected, but genomescope calculated kcov=34.3 leaves me with three peaks. [image: genoscope2_output] https://user-images.githubusercontent.com/7957695/256071045-1d738edb-a75e-44dd-aaf8-aed5f6b1f49c.png

If I force the program to ignore the first peak and set -l 50, then I get ~ half of the genome size, as seen in Panel B. Also, the yellow line predicting the "unique sequence" are striking different between both panels as well.

So, what could explain these distributions, especially in panel A, where three peaks are observed? Is it peak 1-2 (34.3 and 68.6), one very broad peak? If yes, what could explain, or are 1-2 the Heterozygous/Homozygous peaks, then what is peak 3 representing?

I appreciate leads to understand these plots and get a better idea of the nature of the data.

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/103, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP34YFVLE37JKIB2XE3S3XSBWI5ANCNFSM6AAAAAA2X3E5VY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

amv33576 commented 11 months ago

Hi, yes I did, and it is the same result. I found that strange, especially because at 255 there is the value of 5988575, and after that is all zeros. The command that I used was:

kmc -k21 -t20 -m64 hifi_reads.fastq Ai.raw.21mer kmc_tmp kmc_tools transform Ai.raw.21mer histogram Ai.raw.21mer.histo -cx1000

Do you have any thoughts or suggestions regarding this behavior? and also what could explain the three peaks?

Thanks

tbenavi1 commented 11 months ago

Hello, by default the maximum value of a counter for KMC is 255. This can be changed with the "-cs" command. So I would run the following commands:

kmc -k21 -t20 -m64 -ci1 -cs10000 hifi_reads.fastq Ai.raw.21mer kmc_tmp
kmc_tools transform Ai.raw.21mer histogram Ai.raw.21mer.histo -cx10000
tbenavi1 commented 11 months ago

As for the weird peaks, it looks like there may be some contamination in your data. But, just to be sure, can you run KMC and genomescope on your raw reads (before filtering and removing adapters). Thanks.

amv33576 commented 11 months ago

Thank you @tbenavi1. I tried the parameters that you suggested, and the plots look the same: Raw data http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=s0ReVnMpsVb7tG5aGvYK

Filter data http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=Own9I3y5QTBuPkx35UWL

Ai.raw.21mercs.histo.txt Ai.fil.21mercs.histo.txt

mschatz commented 11 months ago

We are getting there -- Notice now the histogram goes up to 10,000.

Can you try again with an even larger limit:

kmc -k21 -t20 -m64 -ci1 -cs1000000 hifi_reads.fastq Ai.raw.21mer kmc_tmp kmc_tools transform Ai.raw.21mer histogram Ai.raw.21mer.histo -cx1000000

Thanks!

Mike

On Fri, Aug 4, 2023 at 1:42 PM amv33576 @.***> wrote:

Thank you @tbenavi1 https://github.com/tbenavi1. I tried the parameters that you suggested, and the plots look the same: Raw data

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=s0ReVnMpsVb7tG5aGvYK

Filter data

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=Own9I3y5QTBuPkx35UWL

Ai.raw.21mercs.histo.txt https://github.com/schatzlab/genomescope/files/12263249/Ai.raw.21mercs.histo.txt Ai.fil.21mercs.histo.txt https://github.com/schatzlab/genomescope/files/12263250/Ai.fil.21mercs.histo.txt

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/103#issuecomment-1665968425, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343WEGJJGCDR72AH7ELXTUX6RANCNFSM6AAAAAA2X3E5VY . You are receiving this because you commented.Message ID: @.***>

amv33576 commented 11 months ago

I've noticed that the tail is becoming longer, and there seem to be more zeros appearing after 25000. However, what intrigues me the most are the strange peaks I'm seeing. The raw data doesn't differ much from the filtered data for which I filtered potential contamination from other species or indexes.

Do you have any advice?

Thank you very much for your help! Anyi

Ai.raw.21mercs.histo.txt

mschatz commented 11 months ago

Can you please post the new histograms using a cutoff of 1000000?

Thanks! Mike

On Fri, Aug 4, 2023 at 5:07 PM amv33576 @.***> wrote:

I've noticed that the tail is becoming longer, and there seem to be more zeros appearing after 25000. However, what intrigues me the most are the strange peaks I'm seeing. The raw data doesn't differ much from the filtered data for which I filtered potential contamination from other species or indexes.

Do you have any advice?

Thank you very much for your help! Anyi

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/103#issuecomment-1666181871, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP342I26Q5TUGUIA3MONDXTVQCNANCNFSM6AAAAAA2X3E5VY . You are receiving this because you commented.Message ID: @.***>

amv33576 commented 11 months ago

Yes, here is with the raw data:

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=P294wASEojZCDeqJfiaf

Thanks,

Anyi

mschatz commented 11 months ago

Sorry for the delay. It looks like the peaks are overdispersed which is confusing the modeling. Did you have to use one of the ultra-low input sequencing kits? Sometimes the extra rounds of PCR lead to biases like this. Fortunately it should still assemble reasonably well. Have you assembled it? What does fastqc report for duplicate reads?

Good luck! Mike

On Wed, Aug 9, 2023 at 8:51 PM amv33576 @.***> wrote:

Yes, here is with the raw data:

http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=P294wASEojZCDeqJfiaf

Thanks,

Anyi

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/103#issuecomment-1672374679, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP3467ASIHASLCPCRJJ73XUQWCLANCNFSM6AAAAAA2X3E5VY . You are receiving this because you commented.Message ID: @.***>