jts / sga

de novo sequence assembler using string graphs
http://genome.cshlp.org/content/22/3/549
237 stars 82 forks source link

sga preqc findSingleCopyPeak finds the wrong peak #166

Closed melop closed 2 years ago

melop commented 2 years ago

Hello, I am using preqc to estimate genome size from short reads. I have two closely related species, where preqc ran successfully on species A. The other species (B) was estimated to have an unrealistically large genome size. Inspecting the verbose log indicates that preqc findSingleCopyPeak step is always finding the global peak at 2, while visual inspection of the histogram clearly indicates the peak being much higher. The count at 2 is indeed high, but this was true for both species.

The only difference is that species B is very inbred. Is preqc trying to identify the heterozygous peak and when failed it treated the peak at 2 as the homozygous?

Below is from the log output for species B: findSingleCopyPeak -- KmerDistribution: Kmer coverage histogram cov count 2 236516 4 13650 6 4796 8 5382 10 8252 12 13059 14 20493 16 31618 18 43539 20 59245 22 75816 24 91866 26 109875 28 124727 30 139129 32 148889 34 156623 36 158641 38 157506 40 154437 42 147793 44 139691 46 130025 48 117405 50 106349 52 94378 54 83132 56 72591 58 61858 60 53093 62 44952 64 38051 66 32654 68 26545 70 22273 ... 9996 11 9998 5 10000 7

10000 92288 Local min: 1 Global peak: 2

jts commented 2 years ago

Hi,

I noticed that all the values printed are even numbers, which shouldn't be the case. This can indicate that one of the fastq files was duplicated in the input - can you check if this might have happened?

Jared

melop commented 2 years ago

Yes indeed that was the reason, thank you!!

Ray

jts commented 2 years ago

Great, glad that fixed it.