samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
675 stars 240 forks source link

genotype concordance and dosage r-squared in bcftools stats #1200

Open prasundutta87 opened 4 years ago

prasundutta87 commented 4 years ago

Hi,

I have two VCF files and want to calculate genotype concordance on a per samples basis. The two VCF files contain only the genotype values that were imputed based on two different approaches and hence I would like to check the genotype concordance rate. I am dealing with only biallelic SNVs and using bcftools stats (v1.6) in order to do calculate genotype concordance rate.

I am aware that I have to specifically look at 'GCsS, Genotype concordance by sample (SNPs)' (Correct me if I am wrong). How can I interpret the table? Can you guide me to proper documentation if it is described anywhere?

I also wanted to get more information about 'dosage r-squared'. What is that value and how is it calculated and interpreted?

Regards, Prasun

pd3 commented 4 years ago

The output is not described properly anywhere, the idea was to make it self-explanatory, more or less.

- sample
- non-reference discordance rate .. the explanation is elsewhere in the file, search for NRD
- RR Hom matches  .. this stands for "number of sites where both samples have GT=0/0"
- RA Het matches    .. both samples have GT=0/1 (at biallelic sites)
- AA Hom matches
- RR Hom mismatches .. one sample has GT=0/0 the other something else
- RA Het mismatches
- AA Hom mismatches
- dosage r-squared   ..  Pearson's r^2 on dosage (number of non-ref alleles in genotype)

Your version of bcftools is rather old by the way.

prasundutta87 commented 4 years ago

Thanks for the answer Petr..the 1.6v of bcftools does not have the definition of dosage r-squared...will try using v1.9. So, I just need to get an idea of the genotype concordance rate and I believe in the 'GCsS, Genotype concordance by sample (SNPs)' section, I should typically look for a low non-reference discordance rate and a high dosage r-squared rate?

janinajeff commented 4 years ago

Hi both, I am trying to do something similar however I am not getting the table from the output. Can you share the commands you use when doing this? Janina

montenegrina commented 4 years ago

Thanks for the answer Petr..the 1.6v of bcftools does not have the definition of dosage r-squared...will try using v1.9. So, I just need to get an idea of the genotype concordance rate and I believe in the 'GCsS, Genotype concordance by sample (SNPs)' section, I should typically look for a low non-reference discordance rate and a high dosage r-squared rate?

Hi

can you please tell me how to assess the output of bcftools stats -s samples old.vcf.gz new.vcf.gz > cmp.stats from GCsS, Genotype concordance by sample (SNPs)' section ? if it is by dosage r-squared how to determine that?

I also did:

plot-vcfstats cmp.stats --no-PDF -p plots/

Can you please tell me which plots are relevant for determining genotype concordance between samples?

Thanks Ana

winni2k commented 4 years ago

Hi everybody, I came across this issue while trying to see if a bug had already been reported, but I think I can help out a little.

can you please tell me how to assess the output of bcftools stats -s samples old.vcf.gz new.vcf.gz > cmp.stats from GCsS, Genotype concordance by sample (SNPs)' section ?

What I am pretty sure about:

  1. Each line starting with GCsAF is the result of binning SNPs by non-reference allele frequency.
  2. Dosage for a genotype is calculated by the formula: GP[RA-HET] + 2*GP[AA-HOM] Example 1: a genotype with RR,RA,AA genotype probabilities 0.2,0.3,0.5 has dose 0.3 + 2*0.5 = 1.3 Example 2: a genotype with only GT field of 0/1 has dose 1 Example 3: a genotype with only GT field of 1/1 has dose 2

What I am less sure about:

  1. In the column [10]dosage r-squared, the values are probably calculated by creating a vector of all dosages for the SNPs in the bin across all samples in the first VCF, and the same vector in the second VCF, and calculating the Pearson correlation coefficient between those two vectors.

This is what my co-authors and I describe as "aggregate R2" in the methods section of https://doi.org/10.1038/ng.3643, reproduced below:

Imputation was carried out using IMPUTE2 (ref. 7), which chooses a custom reference panel for each study individual for each 2-Mb segment of the genome. We set the khap parameter of IMPUTE2 to 1,000. All other parameters were set to default values. We stratified imputed variants into allele frequency bins and calculated the squared correlation between the imputed allele dosages at variants in each bin and the masked CG genotypes (called aggregate R2 values in Fig. 1). The non-reference allele frequency for each SNP was calculated from HRC release 1 genotype likelihoods at sites with MAC ≥5. Figure 1 shows the results for the Illumina Omni1M chip. Supplementary Figures 3 and 4 show the results from the Illumina CoreExome chip and the Illumina Omni5M chip, respectively.

Questions I have:

  1. Which input VCF is used to calculate the AF that is used for binning? Is it perhaps the average of the first and second VCF?
ttbek commented 2 years ago

@winni2k I may not be reading the code correctly, but I'm pretty sure it is doing empirical (i.e. genotyped) comparison only. That is, it uses GT (actual genotypes) only and not GP (genotype probabilities, usually from imputation).

It is indeed referred to as aggregate in the code:


typedef struct
{
    uint64_t gt2gt[5][5];   // number of RR->RR, RR->RA, etc. matches/mismatches; see type2stats
    /*
        Pearson's R^2 is used for aggregate R^2 
        y, yy .. sum of dosage and squared dosage in the query VCF (second file)
        x, xx .. sum of squared dosage in the truth VCF (first file)
        n     .. number of genotypes
     */
    double y, yy, x, xx, yx, n;
}
gtcmp_t;

For your question, unfortunately I cannot seem to quickly follow what is happening with the AF, I don't think it seems like averaging offhand though, but I may just not be seeing it yet. It looks like it might be calculated more than once per se, as I see reference to what I think are allele frequency stats and to sample stats (and for both SNPs and Indels, maybe 4 versions total, but maybe more, 4 per file combination perhaps, since giving two files does the intersection and the complements), does it give both a binned and per sample output? Unfortunately I couldn't examine the two file version of the output due to a segfault. The binned version may be based on a single AF determined by the file combination, that is, by the new AN and AC in the new intersection file for instance. I suppose this would be the same as an average, just not explicit.

Oddly, I get a segfault when I give two files, but I should open a different issue for that (if warranted after I investigate more myself). The version is: bcftools 1.10.2-128-gd1b74cd Using htslib 1.10.2-137-g34ba6c7 Copyright (C) 2019 Genome Research Ltd. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

Either file works fine with stats on its own. I have not yet tested if this is for any two files I give, or just these two.

winni2k commented 2 years ago

I'm not sure I follow what you mean by "actual" genotype. Some genotype imputation tools set the GT field from the GP field. In those cases, GT becomes the maximum a posteriori estimate.

ttbek commented 2 years ago

@winni2k Sorry, yes they do. My internal convention of GT as genotyped slipped out because I don't usually use the imputed GT for comparison, though of course that is what we would carry forward for association testing. What I should have said is that comparing GT genotyped and GT imputed is more discrete than using GP genotyped (just a re-encoding of GT genotyped) and GP imputed, or using the dosages (computed from GP I mean).

That is, GP genotyped adds no extra information (compute from GT genotyped), and neither does GT imputed (compute from GP imputed), so I tend to work with these lately (where my GT is always the genotyped GT) and didn't think enough before I typed.

winni2k commented 2 years ago

Interesting, in my previous collaborations we have always carried the dosage forward for association testing instead of the GT, as selecting the MAP from the GPs reflects an information loss.

ttbek commented 2 years ago

@winni2k My apologies again. You are correct. I should stop mixing myself up. We are doing association from GT because those files are from whole-genome sequencing and are not imputed at all (separate project from the imputation comparison stuff), not because we are using it over GP. Using GP derived dosage for association testing should be better as you say. Don't know what I was thinking that day.