schneebergerlab / findGSE

findGSE is a tool for estimating size of (heterozygous diploid or homozygous) genomes by fitting k-mer frequencies iteratively with a skew normal distribution model.
31 stars 10 forks source link

offset_count error #1

Closed whc2 closed 5 years ago

whc2 commented 5 years ago

Hello, developers.

I would like to use your tool, but there is an error. I simulated 45X human genome by pirs:

/software/bin/pirs simulate  -l 150 -m 350 -x 45 -s /software/install/pirs_simple/Profiles/humNew.PE150.matrix -i ../GCF_000001405.38_GRCh38.p12_genomic.fna.pure -o ./Homo_sapiens_sim_ill_45X

and then counted kmer frequency. Using the command line to estimate genome size

findGSE(histo="Homo_sapiens_ill_sim_40x.freq.stat", sizek=17, exp_hom = 60, outdir="findGSE_k17_het60")

findGSE gives:

findGSE initialized...
   Info: histo file provided as Homo_sapiens_ill_sim_40x.freq.stat.4
   Info: size k set as 17
   Info: output folder set as findGSE_k17_4het60
   Info: expected coverage of homozygous k-mers set as 60
   Info: het observed set as true ==> heterozygous fitting asked.

Iterative fitting process for sample Homo_sapiens_ill_sim_40x.freq.stat.4 started...
    Size 17 fitting for het k-mers
    Warning: only one peak in k-mer freq with given -exp_hom,
                  determined as hom-peak!
            -exp_hom needs to be increased
                  if you think it is a het peak.
    Info: het_peak_pos for het fitting:  17
    Info: hom_peak_pos for hom fitting:  34
    Info: het_xfit_left  for het fitting:  7
    Info: het_xfit_right for het fitting:  25
    Size 17 at itr 1
    Info: min_valid_pos:  17
    Info: signal error border:  14
    Info: hom_xfit_left  for hom fitting at itr  1 :  24
    Info: hom_xfit_right for hom fitting at itr  1 :  44
       Fitting has to be repeated: sizek 17 at itr 1
    Info: hom_xfit_left  for hom fitting at itr  1 :  24
    Info: hom_xfit_right for hom fitting at itr  1 :  44
    Size 17 at itr 2
    Info: hom_xfit_left  for hom fitting at itr  2 :  38
    Info: hom_xfit_right for hom fitting at itr  2 :  137
    Size 17 at itr 3
    Info: hom_xfit_left  for hom fitting at itr  3 :  75
    Info: hom_xfit_right for hom fitting at itr  3 :  209
    Size 17 at itr 4
    Info: hom_xfit_left  for hom fitting at itr  4 :  123
    Info: hom_xfit_right for hom fitting at itr  4 :  247
    Size 17 at itr 5
    Info: hom_xfit_left  for hom fitting at itr  5 :  48
    Info: hom_xfit_right for hom fitting at itr  5 :  181
    Size 17 at itr 6
    Info: hom_xfit_left  for hom fitting at itr  6 :  103
    Info: hom_xfit_right for hom fitting at itr  6 :  209
    Warning: data does not follow assumed distribution anymore at itr 6, fitting stopped.
Iterative fitting done.

    Info: het_fitting_delta optimized as 8.28687852269064

Error in if (offset_count[off_i] > 0) { :
  missing value where TRUE/FALSE needed

If I don't set exp_hom, a very small estimated genome size 2,368,967,846 bp is get. Could you please help me to get the reasons that the tool estimate a much smaller genome size, compared to a well-known 3G.

I also tested ERR1025644 data used in your paper, and find that if I don't set exp_hom:

findGSE(histo = "human_find.freq.stat.his", sizek = 17, outdir = "findGSE_k17")
# then produced a genome size that is much smaller than 3G
Genome size estimate for human_find.freq.stat.his: 2,454,814,208 bp.

If I set exp_hom = 36:

findGSE(histo = "human_find.freq.stat.his", sizek = 17, exp_hom = 36, outdir = "findGSE_k17_het")
# then the genome size is close to 3G
Genome size estimate for human_find.freq.stat.his: 2,847,027,979 bp.

I find that exp_hom have a big influence on the final genome size. Please give me some clues why the error occurs and help me resolve my simulated data estimation problems.

[Uploading Homo_sapiens_ill_sim_45x.freq.stat.his.txt…]()

HeQSun commented 5 years ago

Hi Hengchao,

when you simulated reads, it seems from your cmds that you did not combine a "mutated" genome with the original one as one new file, right? If not, you should have simulated reads from a "homozygous" genome.

In this case, if you provided "exp_hom", meaning that you forced findGSE to estimate genome size for a "heterozygous" genome which was indeed homozygous, then unexpected error might have been encountered.


Regarding the lower estimation (for the simulated genome and ERR1025644 when not using exp_hom for findGSE).

The reason should be that jellyfish or the software you used for kmer histogram generation was not set up properly to get highly repetitive kmers counted exactly, especially you used a very small kmer size of 17.

For example, the last few lines of your histogram file are "... 65534 1 65535 41833"

There was much info getting truncated with the last line "65535 41833".

For human genome, I used cmds like below (larger -m and -h) in case you want to repeat

_sample=ERR1025644 sizek=21 zcat ${sample}_1.fastq.gz ${sample}2.fastq.gz | jellyfish count /dev/fd/0 -C -o ${sample}${sizek}mer -m ${sizek} -t 1 -s 5G; jellyfish histo -h 3000000 -o ${sample}${sizek}mer.histo ${sample}${sizek}mer

Below is my histogram from the above cmd and provided to findGSE, getting 3.010 Gb estimate:

ERR1025644_21mer_example.txt

ERR1025644_21mer_findGSE.pdf

Best, Hequan

whc2 commented 5 years ago

Thanks a lot. I tested your kmer distribution table and use heterozygous mode to finally get an accurate estimation.

I will increase high frequency to test on my simulated data.

Best, Hengchao