skip sites if one sample in the pair is missing data?

goeckeritz commented 10 months ago

Hi Petr,

I have a seemingly simple question so I may be missing something rather obvious...

I have a large number of sample comparisons I'm giving to bcftools' gtcheck, and most samples are low coverage. So, I need to tell gtcheck to skip comparing sites if data is missing in either sample in the pair. Right now it's giving me huge discordance values and I think it's due to this issue.

Is there a simple way to have those be skipped?

Thanks so much!

Kindly, Charity

pd3 commented 10 months ago

Hi, it depends what version of bcftools and what options you are using. Assuming you are working with the latest version, missing data should be skipped automatically and not contribute to the discordance score. One only needs to be careful and make sure the program uses GT vs PL as desired - the default is --use PL,GT, meaning the genotype likelihoods (PL) are taken from the query file to match against genotypes (GT) in the -g genotype file. Best is to test on a single site to check if the program is doing what you think it should be doing...

goeckeritz commented 10 months ago

Hi Petr,

Thanks for getting back to me. Yes, I'm using the latest - version 1.18. I don't have PL in my vcfs, but I have GL (which gtcheck told me it doesn't support)... I used freebayes v 1.3.1 to call variants. Based on the output file, it looks like it is comparing GT to GT. I guess I'm confused why that wouldn't have worked though. Here are the first lines in my output file:

This file was produced by bcftools (1.18+htslib-1.18), the command line was: bcftools gtcheck -g Anna_05X002_filtered.vcf.gz Anna_05X001_filtered.vcf.gz and the working directory was: /cluster/home/cgoeckeritz/SNP_calling/vcfs/filtered

INFO Time required to process one record .. 0.000005 seconds INFO sites-compared 1508653 INFO sites-skipped-no-match 8550972 INFO sites-skipped-multiallelic 7186 INFO sites-skipped-monoallelic 0 INFO sites-skipped-no-data 0 INFO sites-skipped-GT-not-diploid 0 INFO sites-skipped-PL-not-diploid 0 INFO sites-used-PL-vs-PL 0 INFO sites-used-PL-vs-GT 0 INFO sites-used-GT-vs-PL 0 INFO sites-used-GT-vs-GT 1508653 DC, discordance:

query sample
genotyped sample
discordance (either an abstract score or number of mismatches, see -e/-u in the man page for details; smaller is better)
negative log of HWE probability at matching sites (rare genotypes matches are more informative, bigger is better)
number of sites compared (bigger is better) DC [2]Query Sample [3]Genotyped Sample [4]Discordance [5]-log P(HWE) [6]Number of sites compared DC /cluster/home/cgoeckeritz/trimmomatic/paired/NGmerged/all_reads/case3/Anna_05X001 /cluster/home/cgoeckeritz/trimmomatic/paired/NGmerged/all_reads/case3/Anna_05X002 3.482769e+06 2.348309e+07 1508653

I did normalize the data with this command beforehand: bcftools norm -m - -Oz --threads 72 all_malus_samples.vcf.gz -o all_malus_samples_normalized.vcf.gz

Though I'm not sure how that would be messing it up. Any ideas what could be going on? Why wouldn't GT v GT work?

pd3 commented 10 months ago

The program seems to be doing something and has compared many sites

INFO sites-used-GT-vs-GT 1508653

I can't say if the values are realistic or not, but they as well could be. Since you are running with the default -e 40, every mismatching site adds 9.903588. In your case per-site average is 3.482769e+06/1508653 = 2.3 which suggests approximately one quarter of sites have a mismatching genotype. As I am typing this I see there are improvements that can be made to the calculation, but overall it seems like the program is probably doing the right thing.

goeckeritz commented 10 months ago

Hi Petr,

I decided to subset the data and compare the GTs manually between two 'genotypes' (They're really the same genotype, the 05X sample is just the same genotype subsetted to ~0.5X coverage) to help me figure out what is going on.

I filtered for sites where either genotype was missing in this sample pair, so ended up comparing 44 sites. All of them, as expected, were the same. (0/0 == 0/0, 0/1 == 0/1, 1/1 == 1/1, etc).

Here is the output file: This file was produced by bcftools (1.18+htslib-1.18), the command line was: bcftools gtcheck ss3_norm.vcf and the working directory was: /cluster/home/cgoeckeritz/SNP_calling/vcfs/filtered

INFO Time required to process one record .. 0.000120 seconds INFO sites-compared 44 INFO sites-skipped-no-match 0 INFO sites-skipped-multiallelic 0 INFO sites-skipped-monoallelic 0 INFO sites-skipped-no-data 0 INFO sites-skipped-GT-not-diploid 0 INFO sites-skipped-PL-not-diploid 0 INFO sites-used-PL-vs-PL 0 INFO sites-used-PL-vs-GT 0 INFO sites-used-GT-vs-PL 0 INFO sites-used-GT-vs-GT 44 DC, discordance:

query sample
genotyped sample
discordance (either an abstract score or number of mismatches, see -e/-u in the man page for details; smaller is better)
negative log of HWE probability at matching sites (rare genotypes matches are more informative, bigger is better)
number of sites compared (bigger is better) DC [2]Query Sample [3]Genotyped Sample [4]Discordance [5]-log P(HWE) [6]Number of sites compared DC Vallee_05X009 Vallee_orig 8.800440e-03 6.034261e+02 44

Is it odd that I'm still getting a positive Discordance, even if it is a lot smaller?

For a sanity check, I did also run gtcheck while the ./. values were left in this subset of data. Here's those results:

This file was produced by bcftools (1.18+htslib-1.18), the command line was: bcftools gtcheck ss2_norm.vcf and the working directory was: /cluster/home/cgoeckeritz/SNP_calling/vcfs/filtered

INFO Time required to process one record .. 0.000186 seconds INFO sites-compared 13552 INFO sites-skipped-no-match 0 INFO sites-skipped-multiallelic 0 INFO sites-skipped-monoallelic 0 INFO sites-skipped-no-data 0 INFO sites-skipped-GT-not-diploid 0 INFO sites-skipped-PL-not-diploid 0 INFO sites-used-PL-vs-PL 0 INFO sites-used-PL-vs-GT 0 INFO sites-used-GT-vs-PL 0 INFO sites-used-GT-vs-GT 13552 DC, discordance:

query sample
genotyped sample
discordance (either an abstract score or number of mismatches, see -e/-u in the man page for details; smaller is better)
negative log of HWE probability at matching sites (rare genotypes matches are more informative, bigger is better)
number of sites compared (bigger is better) DC [2]Query Sample [3]Genotyped Sample [4]Discordance [5]-log P(HWE) [6]Number of sites compared DC Vallee_05X009 Vallee_orig 8.800440e-03 6.034261e+02 44

So, the same results - yay! Looks like it did successfully skip any sites with ./. GTs. Still unsure why I get a seemingly different behavior when I give it a --pairs-file though... in fact, at some point during my regular runs with the full data and ~80 'genotypes', I gave it a pair of the exact two samples in my pairs file -- like Vallee_orig v Vallee_orig. Here's the first bit of results from that run:

This file was produced by bcftools (1.18+htslib-1.18), the command line was: bcftools gtcheck --pairs-file subset2_comparisons.txt subset2_malus_samples_normalized.vcf.gz and the working directory was: /cluster/home/cgoeckeritz/SNP_calling/vcfs/filtered

INFO Time required to process one record .. 0.000004 seconds INFO sites-compared 304020345 INFO sites-skipped-no-match 0 INFO sites-skipped-multiallelic 0 INFO sites-skipped-monoallelic 0 INFO sites-skipped-no-data 0 INFO sites-skipped-GT-not-diploid 0 INFO sites-skipped-PL-not-diploid 0 INFO sites-used-PL-vs-PL 0 INFO sites-used-PL-vs-GT 0 INFO sites-used-GT-vs-PL 0 INFO sites-used-GT-vs-GT 304020345 DC, discordance:

query sample
genotyped sample
discordance (either an abstract score or number of mismatches, see -e/-u in the man page for details; smaller is better)
negative log of HWE probability at matching sites (rare genotypes matches are more informative, bigger is better)
number of sites compared (bigger is better) DC [2]Query Sample [3]Genotyped Sample [4]Discordance [5]-log P(HWE) [6]Number of sites compared DC Vallee_orig Vallee_orig 1.271508e+04 1.199361e+09 63572208 DC Vallee_orig Vallee_05X009 5.042337e+03 1.387580e+05 7563 DC Vallee_orig Vallee_05X013 5.091840e+03 1.380434e+05 7494

So why would I get such high discordance when comparing the calls with.. itself? I'm not sure why providing it a pairs file would lead to this behavior but this is the main difference between these runs as far as I can tell.. I guess the last run is using a compressed file too but I wouldn't have expected something like that to cause issues either. hmmm...

Anywho, if you had any insight, that would be much appreciated! Thanks so much for your time.

Kindly, Charity

pd3 commented 10 months ago

You are running the program in the mode which compares GT vs GT. The output mentions that the interpretation of the discordance score depends on the value of -e, --error-probability which, by default, is 40. In other words, the program considers the possibility that even the matching genotype could have occurred by chance as a result of a genotyping error. If you run with -e 0, the score should be identical to the number of mismatches.

pd3 commented 10 months ago

As I am typing this I see there are improvements that can be made to the calculation, but overall it seems like the program is probably doing the right thing.

Just pushed the commit https://github.com/samtools/bcftools/commit/39a81bea6a7616e11248e0ce038a4fe67438159b that improves the model. The option --error-probability is newly interpreted as the probability of erroneous allele rather than genotype which allows the calculation of the discordance score to consider the probability of genotyping error to be different for HOM and HET genotypes, i.e. P(0/1|dsg=0) > P(1/1|dsg=0).

I believe this can be now considered as resolved, please shout if not.

samtools / bcftools

skip sites if one sample in the pair is missing data? #2032