schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
253 stars 56 forks source link

Appropriate value for the Max kmer coverage #22

Open Silverfoxcome opened 5 years ago

Silverfoxcome commented 5 years ago

Hi! I'm working with illumina data PE reads of 151 bp of a repeat-rich maize genome (diploid). The genome size estimate is 2.3 Gb (maize B73v4).

I used the following command in Jellyfish:

> jellyfish count -m 21 -s 100M -t 10 -C <(zcat ZM1_R2.fastq.gz) <(zcat ZM1_R1.fastq.gz)
> jellyfish histo mer_counts.jf -o histo.txt

Then I uploaded this 21 kmer length count histogram to genomescope. I changed the default parameters of Read length to 151 bp, but I left the Max kmer coverage parameter with its default value (1000). The results I got were this:

https://ibb.co/fpzpT0P

https://ibb.co/yXmpC3n

And

Results k = 21

property min -- max
Heterozygosity: 1.22337% -- 1.26647%
Genome Haploid Length: 1,116,998,872 bp -- 1,123,182,422 bp
Genome Repeat Length: 626,415,684 bp -- 629,883,434 bp
Genome Unique Length: 490,583,188 bp -- 493,298,988 bp
Model Fit: 87.1894% -- 98.1013%
Read Error Rate: 0.831279% -- 0.831279%

The genome size estimate is 2.3 Gb (maize B73v4) but in genomescope I get 1,123,182,422 bp.

I was wondering if someone could help me set a more appropriate value for the Max kmer coverage parameter because like the the creators warn:

High copy-number DNA such as chloroplasts can confuse the model. Set a max kmer coverage to avoid this. Default is -1 meaning no filter.

And I suppose I will have this problem of High copy-number DNA because my reads are from maize (with lots of chloroplasts DNA).

Thank you a lot for your kind help in advance :)

mschatz commented 5 years ago

Since maize is such a repetitive genome, yes you will need to raise the max kmer limit to be much higher to capture the full genome size. You will probably also need to rerun the jellyfish histo command as well, since that truncates the distribution at 10,000. Can you try again using a limit of 1,000,000? That should capture everything, and then if you see an overabundance of high frequency kmers from the chloroplasts or other sequences you can filter them out afterwards.

Good luck!

Mike

Silverfoxcome commented 5 years ago

Since maize is such a repetitive genome, yes you will need to raise the max kmer limit to be much higher to capture the full genome size. You will probably also need to rerun the jellyfish histo command as well, since that truncates the distribution at 10,000. Can you try again using a limit of 1,000,000? That should capture everything, and then if you see an overabundance of high frequency kmers from the chloroplasts or other sequences you can filter them out afterwards.

Good luck!

Mike

Thanks for your answer Mike! I'll re-run the Jellyfish histo command like you say :+1: My teacher also commented on this xD Just to have an idea, how much should I raise my max kmer limit? Should I use 1000 or more?

Thanks in advance for the help :)

mschatz commented 5 years ago

You will need to both raise the limit in jellyfish and then raise the max kmer limit in GenomeScope. I would use 1,000,000 for both as a place to start. Good luck

Mike

Silverfoxcome commented 5 years ago

You will need to both raise the limit in jellyfish and then raise the max kmer limit in GenomeScope. I would use 1,000,000 for both as a place to start. Good luck

Mike

Thanks for your help! I'll do it as soon as I can :D !

bmansfeld commented 4 years ago

Hi @mschatz and @Silverfoxcome, Thanks for you discussion about this here. I've been trying to estimate the genome size of Cassava a highly heterozygous plant with a moderately sized genome. Our flow cytometry methods show an estimated genome size of somewhere between 690-750Mb, depending on the genotype. I tried several kmer sizes but all estimate a much lower genome size (350-500Mb). Following what was stated in the FAQ and in this thread, I opted for a higher max histogram value. Setting the kmer limit to 1M as suggested here finally gets the GS at 720Mb. Cassava should not have as repetitive a genome as maize however so I wonder about the relevancy here. Some input would be appreciated. My analysis results are here: http://genomescope.org/analysis.php?code=Sjb2et6gHw5KaP2xg0Zs Mike, could you advise about filtering pastid sequences post estimation? Also a side question once I'm here: I've been using only R1 files from the PE seq. I couldn't really find any info on including both files in jellyfish. Any advice is extremely welcome. Thanks in advance, Ben

mschatz commented 4 years ago

Hi Ben

Glad raising the max kmer limit resolved the issue. In the log-log plot, it looks like you do have some unusual high frequency kmers at about 1e3 and 1e4 that are likely to be organelles, but it doesnt sound like there are inflating the overall genome size too much. Not sure what you have available, but you could try mapping the raw reads to existing references to pull them out. You repetitiveness is higher than some other species, but Ive seen worse- this seems to be similar to pear ( http://qb.cshl.edu/genomescope/analysis.php?code=example3). But given the heterozygosity and the repetitiveness this is going to be a really tough assembly with just paired end data. Your best bet is likely Platanus ( https://www.nature.com/articles/s41467-019-09575-2) but if at all possible, Id recommend long reads - lately we have had really good luck with PacBio HiFi reads for genomes like this

Good luck!

Mike

On Tue, Nov 5, 2019 at 3:18 PM Ben Mansfeld notifications@github.com wrote:

Hi @mschatz https://github.com/mschatz and @Silverfoxcome https://github.com/Silverfoxcome, Thanks for you discussion about this here. I've been trying to estimate the genome size of Cassava a highly heterozygous plant with a moderately sized genome. Our flow cytometry methods show an estimated genome size of somewhere between 690-750Mb, depending on the genotype. I tried several kmer sizes but all estimate a much lower genome size (350-500Mb). Following what was stated in the FAQ and in this thread, I opted for a higher max histogram value. Setting the kmer limit to 1M as suggested here finally gets the GS at 720Mb. Cassava should not have as repetitive a genome as maize however so I wonder about the relevancy here. Some input would be appreciated. My analysis results are here: http://genomescope.org/analysis.php?code=Sjb2et6gHw5KaP2xg0Zs Mike, could you advise about filtering pastid sequences post estimation? Also a side question once I'm here: I've been using only R1 files from the PE seq. I couldn't really find any info on including both files in jellyfish. Any advice is extremely welcome. Thanks in advance, Ben

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/22?email_source=notifications&email_token=AABP34YVVGGRNB5LI3XGPXLQSHIHZA5CNFSM4H7H2ZAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDEFWJY#issuecomment-550001447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343ZEHARHE2CP4EVQL3QSHIHZANCNFSM4H7H2ZAA .

bmansfeld commented 4 years ago

Hey Mike, Thanks for answering so quickly, I really appreciate it! In the Pear example it looks like while 1e6 kmers were included in the histo, you still had the analysis cutoff at 1e3. How would that affect the results? Yeah we are assembling the genome with PacBio etc. but I wanted to get a heterozygosity and genome size estimate using our PE data. My second question was more about counting the kmers. Does it matter whether or not we include both of the paired reads? Or is it fine to use just one of the pair? and if I should use both do I just cat them together? Thanks again, Ben

mschatz commented 4 years ago

In retrospect we are probably underestimating the genome size for Pear by limiting ourselves to kmers occurring below 1000 times.

The main reason for including both of the paired reads is to increase the coverage so you get better defined peaks. But you already have nicely resolved peaks so it should not be necessary to do both

Good luck

Mike

On Wed, Nov 6, 2019 at 12:31 PM Ben Mansfeld notifications@github.com wrote:

Hey Mike, Thanks for answering so quickly, I really appreciate it! In the Pear example it looks like while 1e6 kmers were included in the histo, you still had the analysis cutoff at 1e3. How would that affect the results? Yeah we are assembling the genome with PacBio etc. but I wanted to get a heterozygosity and genome size estimate using our PE data. My second question was more about counting the kmers. Does it matter whether or not we include both of the paired reads? Or is it fine to use just one of the pair? and if I should use both do I just cat them together? Thanks again, Ben

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/22?email_source=notifications&email_token=AABP345HB4P4I3E7TFSXFQDQSL5QXA5CNFSM4H7H2ZAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDHLGBQ#issuecomment-550417158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP34YR3BAEVHKVLUK7PGLQSL5QXANCNFSM4H7H2ZAA .

bmansfeld commented 4 years ago

Thanks for clearing that up mike! Appreciate all the help. Ben