samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
650 stars 240 forks source link

Average TsTv Ratio Calculation #1526

Closed UxxUnet closed 3 years ago

UxxUnet commented 3 years ago

I'm using bcftools analysing each chromosome block. However, there're some 0s in transition and transversion within each block. So I wonder how does the bcf_tools deal with those samples with 0 transition or tranversion when it calculates the average TsTv ratio.

image

I calculated myself by simply ignoring those samples but I got a different average TsTv ratio from what bcftools got.

pd3 commented 3 years ago

I don't understand the question. Are you asking if bcftools calculates the overall ts/tv from averaging averages from each block, or something else? The ts/tv calculation is simple enough tstv = tv ? ts/tv : 0

UxxUnet commented 3 years ago

Sorry. I mean those with 0 transversions. And yes, averaging averages from each block. So, when ts =0, ts/tv=0, then how about tv =0 or ts and tv are both 0, what's the value and how does the bcf_tools deal with those values when it calculates the average TsTv ratio?

pd3 commented 3 years ago

We do not average averages, but work with total numbers, if that's what you mean. If the question is simply about how to prevent the division by zero, the code fragment above shows how it's done: if tv=0, tstv is set to 0. It's a bit unfortunate that bioinformatics often decides to go with fractions instead of proportions, unbound values could be easily avoided.

dianacornejo commented 3 years ago

@pd3 some other questions related to titv ratio calculation. I created a MWE to make everything more clear. In the first case there are 14 variants, with 10 transitions and 4 transversions in 2 samples. The global titv is 2.5 and the per sample titv is 0 given that both samples are RefHom. test1.vcf

# TSTV, transitions/transversions:
# TSTV  [2]id   [3]ts   [4]tv   [5]ts/tv    [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV    0   10  4   2.50    10  4   2.50
# PSC, Per-sample counts. Note that the ref/het/hom counts include only SNPs, for indels see PSI. The rest include both SNPs and indels.
# PSC   [2]id   [3]sample   [4]nRefHom  [5]nNonRefHom   [6]nHets    [7]nTransitions [8]nTransversions   [9]nIndels  [10]average depth   [11]nSingletons [12]nHapRef [13]nHapAlt [14]nMissing
PSC 0   smpl1   14  0   0   0   0   0   0.0 0   0   0   0
PSC 0   smpl2   14  0   0   0   0   0   0.0 0   0   0   0

The second example has the exact same number of variants (14) but in this case sample 1 and sample 2 are Het for the transitions and each of them is het for different transversions. Te global titv is 2.5 (same as above) but the per sample titv is 10/2=5 test.vcf

# TSTV, transitions/transversions:
# TSTV  [2]id   [3]ts   [4]tv   [5]ts/tv    [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV    0   10  4   2.50    10  4   2.50
# PSC, Per-sample counts. Note that the ref/het/hom counts include only SNPs, for indels see PSI. The rest include both SNPs and indels.
# PSC   [2]id   [3]sample   [4]nRefHom  [5]nNonRefHom   [6]nHets    [7]nTransitions [8]nTransversions   [9]nIndels  [10]average depth   [11]nSingletons [12]nHapRef [13]nHapAlt [14]nMissing
PSC 0   smpl1   2   0   12  10  2   0   0.0 2   0   0   0
PSC 0   smpl2   2   0   12  10  2   0   0.0 2   0   0   0

The confusion here, is that the global titv does not take into consideration whether the samples have or not the nonREF allele so even if all of the samples in the vcf are 0/0 those variants get counted in the titv ratio. In the PSC each ti or tv is only counted if present in the sample. The other part of this issue would be when there are multiallelic variants and we use left-normalization first and then calculate the titv (the values would be much lower) since all of these variants are counted multiple times. Additionally, a problem I'm facing when analyzing data is that after filtering using DP, GQ, allelic imbalance the titv ratios do not seem to improve when using bcftools. Since I'm working with UKBB the data is divided by blocks per chromosome and therefore the QC is done in each one of them. A question would also be, how does this impact overall titv ratio?

For comparison and completeness I ran the same MWE using snpEff and these are the results test1.vcf

TS/TV stats:
Sample ,smpl1,smpl2,Total
Transitions ,0,0,0
Transversions ,0,0,0
Ts/Tv ,NaN,NaN,NaN

test.vcf

TS/TV stats:
Sample ,smpl1,smpl2,Total
Transitions ,10,10,20
Transversions ,2,2,4
Ts/Tv ,5.000,5.000,5.000

This to me seems a more accurate calculation. Any thoughts on this matter are welcome! Thanks test.vcf.gz test1.vcf.gz

pd3 commented 3 years ago

The program calculates ts/tv for each position globally based on the REF and ALT columns, regardless of samples, and this is the TSTV data block in the output. Then it calculates ts/tv for each sample individually, these are the PSC counts.

If you need anything else, you should apply the -i/-e filtering expressions to work only with sites you are interested in. For the region blocks, you should run stats on each individually, sum together the counts, only then calculate ts/tv.

Regarding MNVs (I am assuming that's what you were referring to when speaking about multiallelics?) you should use bcftools norm --atomize to break it into constituents as stats only takes the first base.

dianacornejo commented 3 years ago

@pd3 thanks for your answer. I was wondering if you have any idea which is the right way to calculate titv ratio, that is if on the variant level without taking into account genotypes or if only the variants for which there's a sample with the alternative allele should be considered in the calculation? Do you know any reference that address this issue. I'm a little bit confused on which is the right way of doing it. Thank you

pd3 commented 3 years ago

It depends what you want,bcftools is already trying to do the right things and outputs per-sample ts/tv as well as overall ts/tv. If the callset contains many mapping and alignment artefacts, the ts/tv will in both cases be smaller, towards the value 0.5 of completely random and nonbiological substitutions.