Open Silverfoxcome opened 5 years ago
Since maize is such a repetitive genome, yes you will need to raise the max kmer limit to be much higher to capture the full genome size. You will probably also need to rerun the jellyfish histo command as well, since that truncates the distribution at 10,000. Can you try again using a limit of 1,000,000? That should capture everything, and then if you see an overabundance of high frequency kmers from the chloroplasts or other sequences you can filter them out afterwards.
Good luck!
Mike
Since maize is such a repetitive genome, yes you will need to raise the max kmer limit to be much higher to capture the full genome size. You will probably also need to rerun the jellyfish histo command as well, since that truncates the distribution at 10,000. Can you try again using a limit of 1,000,000? That should capture everything, and then if you see an overabundance of high frequency kmers from the chloroplasts or other sequences you can filter them out afterwards.
Good luck!
Mike
Thanks for your answer Mike! I'll re-run the Jellyfish histo command like you say :+1: My teacher also commented on this xD Just to have an idea, how much should I raise my max kmer limit? Should I use 1000 or more?
Thanks in advance for the help :)
You will need to both raise the limit in jellyfish and then raise the max kmer limit in GenomeScope. I would use 1,000,000 for both as a place to start. Good luck
Mike
You will need to both raise the limit in jellyfish and then raise the max kmer limit in GenomeScope. I would use 1,000,000 for both as a place to start. Good luck
Mike
Thanks for your help! I'll do it as soon as I can :D !
Hi @mschatz and @Silverfoxcome, Thanks for you discussion about this here. I've been trying to estimate the genome size of Cassava a highly heterozygous plant with a moderately sized genome. Our flow cytometry methods show an estimated genome size of somewhere between 690-750Mb, depending on the genotype. I tried several kmer sizes but all estimate a much lower genome size (350-500Mb). Following what was stated in the FAQ and in this thread, I opted for a higher max histogram value. Setting the kmer limit to 1M as suggested here finally gets the GS at 720Mb. Cassava should not have as repetitive a genome as maize however so I wonder about the relevancy here. Some input would be appreciated. My analysis results are here: http://genomescope.org/analysis.php?code=Sjb2et6gHw5KaP2xg0Zs Mike, could you advise about filtering pastid sequences post estimation? Also a side question once I'm here: I've been using only R1 files from the PE seq. I couldn't really find any info on including both files in jellyfish. Any advice is extremely welcome. Thanks in advance, Ben
Hi Ben
Glad raising the max kmer limit resolved the issue. In the log-log plot, it looks like you do have some unusual high frequency kmers at about 1e3 and 1e4 that are likely to be organelles, but it doesnt sound like there are inflating the overall genome size too much. Not sure what you have available, but you could try mapping the raw reads to existing references to pull them out. You repetitiveness is higher than some other species, but Ive seen worse- this seems to be similar to pear ( http://qb.cshl.edu/genomescope/analysis.php?code=example3). But given the heterozygosity and the repetitiveness this is going to be a really tough assembly with just paired end data. Your best bet is likely Platanus ( https://www.nature.com/articles/s41467-019-09575-2) but if at all possible, Id recommend long reads - lately we have had really good luck with PacBio HiFi reads for genomes like this
Good luck!
Mike
On Tue, Nov 5, 2019 at 3:18 PM Ben Mansfeld notifications@github.com wrote:
Hi @mschatz https://github.com/mschatz and @Silverfoxcome https://github.com/Silverfoxcome, Thanks for you discussion about this here. I've been trying to estimate the genome size of Cassava a highly heterozygous plant with a moderately sized genome. Our flow cytometry methods show an estimated genome size of somewhere between 690-750Mb, depending on the genotype. I tried several kmer sizes but all estimate a much lower genome size (350-500Mb). Following what was stated in the FAQ and in this thread, I opted for a higher max histogram value. Setting the kmer limit to 1M as suggested here finally gets the GS at 720Mb. Cassava should not have as repetitive a genome as maize however so I wonder about the relevancy here. Some input would be appreciated. My analysis results are here: http://genomescope.org/analysis.php?code=Sjb2et6gHw5KaP2xg0Zs Mike, could you advise about filtering pastid sequences post estimation? Also a side question once I'm here: I've been using only R1 files from the PE seq. I couldn't really find any info on including both files in jellyfish. Any advice is extremely welcome. Thanks in advance, Ben
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/22?email_source=notifications&email_token=AABP34YVVGGRNB5LI3XGPXLQSHIHZA5CNFSM4H7H2ZAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDEFWJY#issuecomment-550001447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343ZEHARHE2CP4EVQL3QSHIHZANCNFSM4H7H2ZAA .
Hey Mike, Thanks for answering so quickly, I really appreciate it! In the Pear example it looks like while 1e6 kmers were included in the histo, you still had the analysis cutoff at 1e3. How would that affect the results? Yeah we are assembling the genome with PacBio etc. but I wanted to get a heterozygosity and genome size estimate using our PE data. My second question was more about counting the kmers. Does it matter whether or not we include both of the paired reads? Or is it fine to use just one of the pair? and if I should use both do I just cat them together? Thanks again, Ben
In retrospect we are probably underestimating the genome size for Pear by limiting ourselves to kmers occurring below 1000 times.
The main reason for including both of the paired reads is to increase the coverage so you get better defined peaks. But you already have nicely resolved peaks so it should not be necessary to do both
Good luck
Mike
On Wed, Nov 6, 2019 at 12:31 PM Ben Mansfeld notifications@github.com wrote:
Hey Mike, Thanks for answering so quickly, I really appreciate it! In the Pear example it looks like while 1e6 kmers were included in the histo, you still had the analysis cutoff at 1e3. How would that affect the results? Yeah we are assembling the genome with PacBio etc. but I wanted to get a heterozygosity and genome size estimate using our PE data. My second question was more about counting the kmers. Does it matter whether or not we include both of the paired reads? Or is it fine to use just one of the pair? and if I should use both do I just cat them together? Thanks again, Ben
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/22?email_source=notifications&email_token=AABP345HB4P4I3E7TFSXFQDQSL5QXA5CNFSM4H7H2ZAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDHLGBQ#issuecomment-550417158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP34YR3BAEVHKVLUK7PGLQSL5QXANCNFSM4H7H2ZAA .
Thanks for clearing that up mike! Appreciate all the help. Ben
Hi! I'm working with illumina data PE reads of 151 bp of a repeat-rich maize genome (diploid). The genome size estimate is 2.3 Gb (maize B73v4).
I used the following command in Jellyfish:
Then I uploaded this 21 kmer length count histogram to genomescope. I changed the default parameters of Read length to 151 bp, but I left the Max kmer coverage parameter with its default value (1000). The results I got were this:
https://ibb.co/fpzpT0P
https://ibb.co/yXmpC3n
And
The genome size estimate is 2.3 Gb (maize B73v4) but in genomescope I get 1,123,182,422 bp.
I was wondering if someone could help me set a more appropriate value for the Max kmer coverage parameter because like the the creators warn:
And I suppose I will have this problem of High copy-number DNA because my reads are from maize (with lots of chloroplasts DNA).
Thank you a lot for your kind help in advance :)