underestimated genome size

roblanf commented 7 years ago

This is more a usage question than an issue. I have a converged run, based on ~30x 100bp PE sequencing. The genome size estimate I get is ~527MB. From related spp and flow cytometry I know the real genome size is closer to ~650MB.

In the plot I notice a little coverage hump right before my cutoff at 1000. But changing the coverage cutoff does little to bring the genome size estimate closer to what I'm 99% sure is right.

Any thoughts on:

Settings I might alter to make the analysis better
Whether I might be limited by coverage here (seems 30x is near the boundary of what you suggest)
If genome size is lower than expected, whether this suggests anything about other estimates being a little off too.

(I can add more data if necessary...)

plot log

mschatz commented 7 years ago

When we see humps of high frequency kmers (like your near 1000x and near 10000x) we often see those kmers are not part of the nuclear genome and instead come from mitochondrial sequences, chloroplast sequences, phiX spike ins, or other technical artifacts. If you want to investigate them you can extract just those kmers using jellyfish, and then align/blast them to different databases to see what they are. Unfortunately every project is a little bit different so will require some investigation.

Cheers

Mike

roblanf commented 7 years ago

Thanks for the tip - particularly extracting the kmers of interest. The other thing that occurred to me is that this is a repeat-rich genome (~40%) so it might be that a cutoff of 1000 ends up excluding a fair number of kmers from real nuclear sequences. I know from mapping that the coverage of the chloroplast genome is ~9000 per sample, so I should probably set the cutoff a little higher. I guess on one level it becomes very difficult (impossible?) to distinguish kmers from e.g. chloroplasts from those that are from common repeats. Final option would be to split the reads before input, and exclude reads that map to chloroplasts / mitochondria from close relatives (or just build rough versions of these de-novo and remove reads that way), then use a much higher cutoff.

Great software either way. Hugely appreciate the effort that has gone in to this.

On 1 April 2017 at 00:40, Michael Schatz notifications@github.com wrote:

Closed #4 https://github.com/schatzlab/genomescope/issues/4.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/4#event-1024101351, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2pE7fSVQ33IlNVew-_TfXKs16qmgD6ks5rrQJcgaJpZM4Mvk_Q .

-- Rob Lanfear Ecology, Evolution, and Genetics, The Australian National University, Canberra

www.robertlanfear.com

mschatz commented 7 years ago

Yes, there is not a universal algorithm for deciding which kmers are from the nuclear genome and which are from the organelles (or other artifacts), at least nothing that I can see. If you know the organelle genomes, then yes, you can map the reads and recompute the histogram excluding any read from the organelle, but obviously that only works if you already know what to exclude.

Good luck

Mike

On Fri, Mar 31, 2017 at 5:16 PM, roblanf notifications@github.com wrote:

Thanks for the tip - particularly extracting the kmers of interest. The other thing that occurred to me is that this is a repeat-rich genome (~40%) so it might be that a cutoff of 1000 ends up excluding a fair number of kmers from real nuclear sequences. I know from mapping that the coverage of the chloroplast genome is ~9000 per sample, so I should probably set the cutoff a little higher. I guess on one level it becomes very difficult (impossible?) to distinguish kmers from e.g. chloroplasts from those that are from common repeats. Final option would be to split the reads before input, and exclude reads that map to chloroplasts / mitochondria from close relatives (or just build rough versions of these de-novo and remove reads that way), then use a much higher cutoff.

Great software either way. Hugely appreciate the effort that has gone in to this.

On 1 April 2017 at 00:40, Michael Schatz notifications@github.com wrote:

Closed #4 https://github.com/schatzlab/genomescope/issues/4.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/4#event-1024101351, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2pE7fSVQ33IlNVew-_ TfXKs16qmgD6ks5rrQJcgaJpZM4Mvk_Q .

-- Rob Lanfear Ecology, Evolution, and Genetics, The Australian National University, Canberra

www.robertlanfear.com

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/4#issuecomment-290832248, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL987U6s1suaTgP20uZ6aykrkMUEqnnks5rrW0rgaJpZM4Mvk_Q .

schatzlab / genomescope

underestimated genome size #4