gymreklab / GangSTR

A tool for profiling long STRs from short reads
GNU General Public License v2.0
85 stars 16 forks source link

Help with understanding GGL values #111

Closed hchetia closed 3 years ago

hchetia commented 3 years ago

Each entry in the VCF file has a bunch of GGL (gangstr genotype likelihoods). What do these scores mean?

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT WGS

chr1 14070 . cctccctccctc . . . END=14081;RU=cctc;PERIOD=4;REF=3;GRID=1,6;STUTTERUP=0.05;STUTTERDOWN=0.05;STUTTERP=0.9;EXPTHRESH=-1 GT:DP:Q:REPCN:REPCI:RC:ENCLREADS:FLNKREADS:ML:INS:STDERR:QEXP:GGL 0/0:4:0.493238:3,3:3-3,3-3:3,1,0,0:3,3:NULL:19.5377:379.445,151.3:0,0:-1,-1,-1:-15.8183,-13.4988,-12.6286,-9.55703,-9.43554,-8.48511,-13.4849,-12.6063,-9.40548,-12.5884,-15.7571,-13.424,-9.50102,-13.4195,-15.7181,-16.4592,-13.4957,-9.48244,-13.4981,-16.4644,-18.6745

nmmsv commented 3 years ago

Those are the genotype likelihoods from the grid. The grid covers 1 to 6 copies. This is the description from readme:

Genotype Likelihood of all pairs of alleles in the search space. Formatted similar to standard GL fields but with allele space defined by the INFO/GRID field

The following function in the code creates the GGL string for each call. You can see how the order is made. https://github.com/gymreklab/GangSTR/blob/786ed5c3c036f44e598d58cd49fafbf543cc11e7/src/vcf_writer.cpp#L191

Hope this helps.

hchetia commented 3 years ago

Hi @nmmsv thanks for the reply. It would so helpful if you could help me understand the following-

The reference allele is CCTCTTCTCCTC. The genotype mentioned is 0/0 meaning homozygous, am I right?

Can GGL scores help us prioritize which genotype call is more likely to be true and is there any recommended threshold for that? With respect to the example displayed above in screenshot, how should I use GGL scores to judge the call as there are multiple GGL values in it. I actually have majority of genotype calls with single GGL values, which is why this call particularly confused me.

Thanks, Hasna

nmmsv commented 3 years ago

Yes, that is correct.

You can make some judgements based on the GGL field (as to how confident the call is). The GGL field will allow you to plot the genotype likelihood plane for all the possible values in the search range (grid). So you can compare the value of likelihood at the maximum likelihood compared to the rest. A similar calculation is made for computing Q scores (posterior probability of the call), so those can be used to achieve a similar goal. If you only have one value in your range, GGL will only have one value as well.

Hope this helps. Nima

hchetia commented 3 years ago

Thank you @nmmsv.