schneebergerlab / findGSE

findGSE is a tool for estimating size of (heterozygous diploid or homozygous) genomes by fitting k-mer frequencies iteratively with a skew normal distribution model.
31 stars 10 forks source link

Is findGSE sensitive to PCR repeats and organelle genomes? #6

Closed Caoyu819 closed 3 years ago

Caoyu819 commented 3 years ago

Hi~ Dr. Sun I have estimated genome size of an arbor tree, Platycarya strobilacea, using findGSE and another software, genomeScope 2.0. But when I try to compare the result of two softwares, I found findGSE always giving a much larger value than genomeScope 2.0 (please see attached file for detail). I am puzzled and hope you can help.

The input files used in both software are the same cleandata which are obtained after removing adapters. The histo file of kmer depth is calculated by kmc through the following two commond lines (set kmer=21 for example):

  1. kmc -k21 -t10 -m20 -ci1 -cs10000 @Pstr.list Pstr tmpDir_ Pstr
  2. kmc_tools transform Pstr histogram Pstr.histo -cx10000 And I noticed in your original paper of findGSE, in section 2.3 (Pre-processing of real reads and selecting size of k), you mentioned you have filtered reads which are duplicated by PCR amplification and similar to mitochondrial, chloroplast or phiX genomes. By setting the low boundary of kmer frequency (-cx10000 in kmc), I think I should have filtered the most possible reads which are produced by before mentioned artifacts.

So I wonder if the findGSE is sensitive to reads which are duplicated by PCR amplification and organelle sequencing, should I filter the duplicated reads in the input cleandata before kmer depth counting? Finally, may I refer to your detailed process of filtering artificial reads?I’ll be very grateful to you if it is possible.

Thank you for reading the question and I’m looking forward to your reply.

Best wishes for you~

You can also contact me via the email (caoyuchn@yeah.net), thanks again~

Yu Cao Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Science, Beijing Normal University, China compare_findGSE-genomeScope

HeQSun commented 3 years ago

Hi 曹昱,

thanks for your interests in using findGSE.

--

I think it is risky to filter k-mers purely based on k-mer coverage, because the genome itself might be highly repetitive. For instance, the centromeric repeat can occur in millions of copies, and thus the respective k-mers. If such repetitive k-mers are filtered out, you would see an underestimated genome size.

I have some questions regarding your results:

  1. what is the sequencing depth of your sample? How does the k-mer freq distribution look like?
  2. what parameters did you use for both tools? For genomescope, did you set up an upper bound on k-mer coverage? What is the cutoff?
  3. it would be good if you can share some output pdfs, for example, for k=21, for both tools.

And, all tools would be affected by pcf amplifications/organelle sequencing, because such unexpected info would influence the shape of k-mer freq distribution (and thus the average k-mer coverage) and the total number of genomic k-mers, which are the two key parameter for determining genome size.

You can test and try the attached filtering pipeline for filtering non-genomic reads.

Best, Hequan