Closed whc2 closed 5 years ago
Hi Hengchao,
when you simulated reads, it seems from your cmds that you did not combine a "mutated" genome with the original one as one new file, right? If not, you should have simulated reads from a "homozygous" genome.
In this case, if you provided "exp_hom", meaning that you forced findGSE to estimate genome size for a "heterozygous" genome which was indeed homozygous, then unexpected error might have been encountered.
Regarding the lower estimation (for the simulated genome and ERR1025644 when not using exp_hom for findGSE).
The reason should be that jellyfish or the software you used for kmer histogram generation was not set up properly to get highly repetitive kmers counted exactly, especially you used a very small kmer size of 17.
For example, the last few lines of your histogram file are "... 65534 1 65535 41833"
There was much info getting truncated with the last line "65535 41833".
For human genome, I used cmds like below (larger -m and -h) in case you want to repeat
_sample=ERR1025644 sizek=21 zcat ${sample}_1.fastq.gz ${sample}2.fastq.gz | jellyfish count /dev/fd/0 -C -o ${sample}${sizek}mer -m ${sizek} -t 1 -s 5G; jellyfish histo -h 3000000 -o ${sample}${sizek}mer.histo ${sample}${sizek}mer
Below is my histogram from the above cmd and provided to findGSE, getting 3.010 Gb estimate:
Best, Hequan
Thanks a lot. I tested your kmer distribution table and use heterozygous mode to finally get an accurate estimation.
I will increase high frequency to test on my simulated data.
Best, Hengchao
Hello, developers.
I would like to use your tool, but there is an error. I simulated 45X human genome by pirs:
and then counted kmer frequency. Using the command line to estimate genome size
findGSE gives:
If I don't set exp_hom, a very small estimated genome size 2,368,967,846 bp is get. Could you please help me to get the reasons that the tool estimate a much smaller genome size, compared to a well-known 3G.
I also tested ERR1025644 data used in your paper, and find that if I don't set exp_hom:
If I set exp_hom = 36:
I find that exp_hom have a big influence on the final genome size. Please give me some clues why the error occurs and help me resolve my simulated data estimation problems.
[Uploading Homo_sapiens_ill_sim_45x.freq.stat.his.txt…]()