rwdavies / QUILT

GNU General Public License v3.0
45 stars 10 forks source link

GT/GP discrepancy and INFO_SCORE reliability #17

Open SABiagini opened 2 years ago

SABiagini commented 2 years ago

Hi, I'm writing to report a couple of issues I noticed after imputing my data with QUILT.

My observations are based on the imputation of a sample for which I also have the high coverage data.

1) After the imputation I found that in some cases there is a discrepancy between the imputed GT and the most likely GP value.

For example:

chr1 3944190 chr1_3944190_C_T C T GT:GP:DS:HD 0|0:0.004,0.995,0.001:0.997:0.001,0.009

In this case, the GT has been imputed as 0|0 but the GP that supports this observation is 0.004, while the GP for the heterozygous genotype is 0.995, which indeed corresponds to the true genotype also found in the high coverage data.

Looking at the imputed data, I can see that 16022 sites with an imputed GT 0/1 have a GP>0.9 for the 0/0 genotype (which indeed is the true genotype reported in the high coverage data for all those sites).

The same can be observed for 4414 sites with an imputed GT 0/1 that have a GP>0.9 for the 1/1 genotype (which is also the true genotype reported in the high coverage data for all those sites).

2) Another thing I noticed is that the info scores are not correctly associated with the imputation quality. In many cases where the info score is 1 (which I assumed was the best quality score possible) the called GT is not the correct one, especially in relation with the heterozygous sites.

For example, in my sample more than the 90% of the heterozygous sites with an info score equal to 1 are incorrect.

More specifically:

Number of total imputed sites with an INFO_SCORE=1 :

0/0 = 30059423
0/1 = 4665
1/1 = 358627

Of these, the correctly imputed ones are:

0/0 = 30042105
0/1 = 413
1/1 = 357882

And the ones that where not correctly imputed are:

0/0 = 17318
0/1 = 4252
1/1 = 745

So, this behavior is affecting all the categories but more specifically the heterozygous sites.

The percentage of wrongly imputed genotypes showing an INFO_SCORE=1 is thus:

0/0 = 0.06% 
0/1 = 91.15%    
1/1 = 0.21%

While for the GT/GP discrepancy I can fix it with bcftools, what I observed with the info score is challenging the reliability of this parameter as a good post imputation quality filter.

I don't know if you have already observed these two behaviors before, but I would truly appreciate your help to understand the reasons behind them.

Thank you so much

rwdavies commented 2 years ago

Hi,

Thanks for the detailed message.

So as a quick refresher, QUILT works by Gibbs sampling where draws are over read labels. Given read labels, it is possible to impute using the full reference panel. Each draw generates an estimate of the genotype probabilities (the GP), which can be combined easily because they are unphased. Each draw also generates phased haplotype dosages, which can be turned into haplotypes easily (by rounding), but can't easily be combined (what to do in a phase switch error for instance).

So what is done is that GP is made by combining over multiple replicates (7 I think for the default), whereas afterwards, a single additional round of imputation is performed on a set of read labels initialized (and then sampled) to try and have minimal phase switch errors. This additional round is used to make the GT.

Normally these are very similar (I would hope this is the case?), especially for non-rare SNPs. But sometimes particularly for rare SNPs there might be discrepancies. I have to admit I'm surprised by the scale of the discrepancies. What scale are you reporting here e.g. this is 16000 SNPs out of how many?

This is a good point nonetheless and I should go back and think about it more. Thanks for bringing it to my attention. One option is to force the GT to agree with the GP, the simplest way being to set to missing those that don't agree (like the ones you highlight, but probably more), or otherwise, to try and phase the GP from the final phasing round.

For the second issue, the INFO score is based on the GP so shouldn't suffer from the problems above. As an aside, for STITCH, which imputes without a reference but builds its own, anecdotally, the INFO score is a very good predictor, better than anything else I've looked at. But it build its own reference, so high INFO scores require both confident imputation and inferred reference haplotypes in which alleles are uniquely placed onto reference haplotypes. But for QUILT, a high INFO score indicates confidence in the imputation, but says less about the reference quality. Basically, if 1 reference haplotype is a good match, it will always impute those SNPs confidently. So rare SNPs where imputation is confident will have high INFO scores but could be incorrect if the underlying reference panels are wrong. So could be worth double checking the reference panel, maybe some phasing of some rare alleles is incorrect.

It's also worth noting that it doesn't look like your samples have many true het sites? These look nearly fully inbred? (Though I'm not sure if the ratio above holds for SNPs with an info score not equal to 1). So imputed het sites might truly mostly be false. QUILT might have just struggled with those regions, or the hets could be concentrated near where the method switches from one reference haplotype to another.

Thanks again for the message, and let me know,

Thanks, Robbie

SABiagini commented 2 years ago

Dear Robbie,

thank you so much for taking the time to answer me.

As a general update I can tell you that I went a little more deep into the data, and as a general summary these are my observations:

GT  Total_Imputed_Sites Correct Incorrect   MAF<=0.05   Force_Gt_to_GP  max(GP)>=0.99   %_GT/GP_Discrepancy
0/0 32782515    32607234    175281  46456   32656372    32678982    0.07%
0/1 1871908 1656928 214980  23941   1752755 1768717 0.90%
1/1 1111876 1053617 58259   1195    1079411 1082002 0.24%

First of all, I observed that the GT/GP discrepancy affected both rare and non-rare sites (that seem to be the most affected); column "MAF<=0.05" is indeed reporting how many rare sites are present out of the "Incorrect" ones.

The strategy I adopted is to force the GT to agree with the GP using bcftools +tag2tag plugin (in column "Force_Gt_to_GP" is reported the number of total sites after this procedure). However, it is not always possible to move in this direction since there are cases when it is impossible to force this change. I identified eight cases when this happens, for example when the GP of the assigned GT is identical to one of the other two GP and lower than the other one; in this case the site will be stuck in the assigned GT.

However, I quantified these variants that cannot be fixed and checked whether a filter on the GP values could reduce their impact. In column "max(GP)>=0.99" I reported how many sites will remain after the GP filter, and in the last column I quantified the percentage of incorrectly imputed genotypes that will remain. Overall, they represent the 0.12% of all the correctly imputed sites plus those that pass the correction filters, which is not bad.

Of course, I'm in an exploratory phase and I'm using a sample for which I also have the high coverage data, so I can check the truth. But in the real application I'm working with thousands of similar samples for which I have no high coverage data. So, with these tests I made I can only state that there is a margin of error that I don't know how to justify.

Regarding the INFO score, from what you say I understand that with QUILT it can be used to check how confident the imputation was but that the reference panel can affect how correct it is. I'm using the HRC panel that also includes the 1000 genomes samples and so far I haven't noticed anything suspicious. Besides the GT/GP discrepancy I showed you, other statistics are quite good (the imputation accuracy is really good).

I think this post imputation strategy I tested could be the right one but I'm still struggling in justifying why after QUILT I have to do so. Also, even if bcftools +tag2tag is a good option, unfortunately it removes the phase, which is quite annoying.

I truly thank you for your help and support.

Simone

rwdavies commented 2 years ago

Hi Simone,

Thanks for this further data. OK, so humans with HRC, I can take a look on my end. What coverage are you looking at?

Interesting about the GT/GP discrepancy being both rare and non-rare, I'll take a look at that.

Thanks, Robbie

rwdavies commented 2 years ago

Hey,

So I wanted to build a test case locally on my end to explore this and my first attempt I'm not seeing this issue (with 1X data). I'm using HRC with NA12878 from normal short read Illumina.

On a 2 Mbp region with 500kbp buffer, I see perfect correlation between integer-coded genotyped defined 3 ways defined in the following three ways (1: the current GT entry 2: rounding the dosage 3: argmax on the genotype posterior)

gt1 <- as.integer(hap1) + as.integer(hap2) 
gt2 <- round(ds * 2)
gt3 <- apply(results[, c("GP1", "GP2", "GP3")], 1, which.max) - 1

Now this could be somewhat of a fluke, but can you give me a bit more detail about what you're running, or if you're able, try checking one of your regions with the code below (can't seem to be able to attach code anymore?), to double check we're defining things in broadly the same way?

Thanks, Robbie

DIR=/well/davies/users/dcc832/single_imp/hrc_impute_test_package/
mkdir ${DIR}
cd ${DIR}

echo /well/davies/users/dcc832/single_imp/2020_10_25/bams/NA12878.NA12878NYG.cov.1.0.chr20.downOnly.bam > bamlist.txt

POSFILE=/well/davies/users/dcc832/single_imp/2020_10_25/genotypes/pos.NA12878.chr20.20000001.22000000.txt

## echo -e "NA12878\tNA12878\tNA12878" > ${DIR}phase.txt
echo -e "NA12878" > ${DIR}col
gunzip -c /well/davies/users/dcc832/single_imp/2020_10_25/genotypes/phase.NA12878.chr20.20000001.22000000.phased.vcf.gz| grep -v "#" | cut -f10 >> ${DIR}col
cp ${DIR}col ${DIR}phase.txt
##paste -d "\t" ${DIR}col ${DIR}col ${DIR}col >> ${DIR}phase.txt
PHASEFILE=${DIR}phase.txt

rm -r -f quilt_output
~/proj/QUILT/QUILT.R \
--outputdir=quilt_output \
--chr=chr20 \
--regionStart=20000001 \
--regionEnd=22000000 \
--buffer=500000 \
--bamlist=bamlist.txt \
--posfile=${POSFILE} \
--phasefile=${PHASEFILE} \
--reference_haplotype_file=/well/davies/shared/hrc_ega/sinan_2020_06_20/hg38_liftover_chr20.hap.gz \
--reference_legend_file=/well/davies/shared/hrc_ega/sinan_2020_06_20/hg38_liftover_chr20.legend.gz \
--genetic_map_file=~/proj/QUILT/package_2021_01_15A/CEU-chr20-final.b38.txt.gz \
--nGen=100 \
--save_prepared_reference=TRUE
library("testthat")
library("QUILT")
dir <- "~/proj/QUILT/"
setwd(paste0(dir, "/QUILT/R"))
a <- dir(pattern = "*.R")
b <- grep("~", a)
if (length(b) > 0) {
    a <- a[-b]
}
o <- sapply(a, source)

setwd("/well/davies/users/dcc832/single_imp/hrc_impute_test_package/")

posfile <- "/well/davies/users/dcc832/single_imp/2020_10_25/genotypes/pos.NA12878.chr20.20000001.22000000.txt"
## so - load phased data, load data for a run, i.e. GP, GT, try some things!
out <- get_and_validate_pos_gen_and_phase(
    posfile = posfile,
    genfile = "",
    phasefile = "phase.txt",
    chr = chr,
    verbose = TRUE
)
pos <- out$pos
gen <- out$gen
phase <- out$phase
truth_haps <- phase[, 1, ]

vcf <- fread("gunzip -c quilt_output/quilt.chr20.20000001.22000000.vcf.gz", data.table = FALSE)

## super hacky but get in an easier form
results <- t(sapply(strsplit(vcf[, "NA12878"], ":"), I))
results <- cbind(substr(results[, 1], 1, 1), substr(results[, 1], 3, 3), results[, -1])
x <- t(sapply(strsplit(results[, 3], ","), I))
results <- cbind(results[, 1:2], x, results[, 4:5])
x <- t(sapply(strsplit(results[, 7], ","), I))
results <- cbind(results[, 1:6], x)
results <- matrix(as.numeric(results), nrow  = nrow(results), ncol = ncol(results))
colnames(results) <- c("GT1", "GT2", "GP1", "GP2", "GP3", "DS", "HD1", "HD2")
head(results)

load("quilt_output/RData/QUILT_prepared_reference.chr20.20000001.22000000.RData")

robbie_f <- function(hap1, hap2, ds) {
    x1 <- calculate_pse_and_r2_during_gibbs(
        inRegion2 = rep(TRUE, nrow(results)),
        hap1 = hap1,
        hap2 = hap2,
        truth_haps = truth_haps[inRegion2, ],
        af = af[inRegion2],
        verbose = FALSE
    )
    names(x1)[1] <- "r2_gt"
    ## r2 from dosage
    w <- (inRegion2)
    g <- truth_haps[inRegion2, 1] + truth_haps[inRegion2, 2]
    r2 <-  round(cor((ds) - 2 * af[inRegion2], g - 2 * af[inRegion2], use = "pairwise.complete.obs") ** 2, 3)
    ## how many sites disagree?
    gt1 <- as.integer(hap1) + as.integer(hap2)
    gt2 <- round(ds * 2)
    gt3 <- apply(results[, c("GP1", "GP2", "GP3")], 1, which.max) - 1
    print(table(gt1, gt2))
    print(table(gt1, gt3))
    c(x1, r2_ds = r2)
}

## current approach with GT
robbie_f(
    hap1 = results[, "GT1"],
    hap2 = results[, "GT2"],
    ds = results[, "DS"]
)

I get these results

    gt2
 gt1     0     2     4
   0 24904     0     0
   1     0  1165     0
   2     0     0   600
    gt3
 gt1     0     1     2
   0 24904     0     0
   1     0  1165     0
   2     0     0   600
 r2_gt   pse  disc r2_ds
 0.993 0.200 0.500 0.994
SABiagini commented 2 years ago

Hi Robbie, Thank you for running this test. I will run your code on my data as soon as I can. One note: usually, I don't use the phasefile not even for the samples for which I have high coverage data, is it a problem? I don't see the reason to use it since for the real data I won't have the chance to provide it anyway.

In the meanwhile, let me tell you some detail about what I'm running:

MYTMPDIR="/dodrio/scratch/projects/2021_058/QUILT/output/NIPT/"$chr"/temp."$chr.$window""
trap 'rm -rf -- "$MYTMPDIR"' EXIT

/dodrio/scratch/projects/2021_058/QUILT/QUILT/QUILT.R \
--outputdir=/dodrio/scratch/projects/2021_058/QUILT/ref/ref_${chr} \
--output_filename=/dodrio/scratch/projects/2021_058/QUILT/output/NIPT/$chr/${chr}.$window.$start.$end.vcf.gz \
--chr=${chr} \
--regionStart=${start} \
--regionEnd=${end} \
--buffer=250000 \
--tempdir=/dodrio/scratch/projects/2021_058/QUILT/output/NIPT/$chr/temp.${chr}.$window \
--bamlist=/dodrio/scratch/projects/2021_058/QUILT/bam_files/bamlist \
--posfile=/dodrio/scratch/projects/2021_058/QUILT/posfile/posfile_${chr} \
--nCores=5

Where $chr, $window, $start, and $end are variables passed to the script through a file I generate for the scheduler. I'm imputing batches of 1250 samples and, depending on the queuing system I can have them fully imputed within 5-6 days. So far, I have imputed 15000 of them but I still have 60000 to go. I hope these information are enough to give a more clear view on my work. Please, let me know if you need more details. Best, S.

rwdavies commented 2 years ago

OK thanks that all seems very sensible. I'll run some 0.1X data now. re: phasefile, you are right that this won't affect imputation quality, it's just to quickly look at the results.

rwdavies commented 2 years ago

OK after trying both 0.1X and removing NA12878 from the reference panel (of course!), I see some discrepancies. I'll take a deeper look later today or tomorrow.

rwdavies commented 2 years ago

This is what I see for 0.1 and 1X, where pse is normal phase switch error (%), disc is related to PSE, and then r2 is the normal r2, p is the proportion (%) of integer coded GT flavours different from the truth, and c* is the proportion (%) of integer coded GT flavours that disgree with each other, where suffixes are 1: the current GT entry 2: rounding the dosage 3: argmax on the genotype posterior). r2 on its own is normal r2 (i.e. un-rounded dosage vs truth genotypes). So for lower coverage definitely some disagreements between the phasing derived GT and the GT derived from dosage/genotype posteriors. I'll see what I can do about it later today / tomorrow

        0.1   1.0
 pse  0.100 0.000
 disc 2.100 0.500
 r2   0.961 0.995
 r2.1 0.969 0.995
 r2.2 0.930 0.995
 r2.3 0.930 0.995
 p.1  0.154 0.026
 p.2  0.349 0.026
 p.3  0.345 0.026
 c.12 0.315 0.007
 c.13 0.319 0.007
 c.23 0.011 0.000
SABiagini commented 2 years ago

Many thanks for looking into it :)

rwdavies commented 2 years ago

OK so I tried the obvious thing, which is where the arg-max of the genotype posterior disagrees with the current GT (GT-GP) (based on rounding the haplotype dosages), if the current GP-GT is 0 or 1, then I make the phased-GT be 0|0 or 1|1, while if the current GP-GT is 1, I set the haplotype with the larger haplotype dosage to have a 1 and then other a 0. Doing this (with suffix N below) on this test region I see the below - i.e. phasing doesn't seem negatively affected. Seems OK but needs more data to test properly. I'll try to run this on more regions tomorrow

        0.1   1.0  0.1N  1.0N
 pse  0.100 0.000 0.100 0.000
 disc 2.100 0.500 4.100 0.400
 r2   0.961 0.995 0.961 0.995
 r2.1 0.969 0.995 0.930 0.995
 r2.2 0.930 0.995 0.930 0.995
 r2.3 0.930 0.995 0.930 0.995
 p.1  0.154 0.026 0.345 0.026
 p.2  0.349 0.026 0.349 0.026
 p.3  0.345 0.026 0.345 0.026
 c.12 0.315 0.007 0.011 0.000
 c.13 0.319 0.007 0.000 0.000
 c.23 0.011 0.000 0.011 0.000
rwdavies commented 2 years ago

Sorry I left this for so long, lots of teaching / admin lately. I checked a run I did on just chromosome 20 for 20 Mbp, and for example 1.0X, I see the following

where QUILT is the current approach where phase (0 vs 1) is based on rounding the haplotype dosages of the final iteration where QUILT-A is the proposed alternate approach defined above which forces the phased integer genotypes to agree with the argmax of the genotype posterior

Method  Input   pse pse2    disc
Optimal Illumina ht 0.01    0.95    1.13
Optimal Illumina    0.01    1.03    1.08
Optimal ONT 0   1.67    2.13
QUILT   Illumina ht 0.11    1.54    1.88
QUILT   Illumina    0.08    1.85    2.45
QUILT   ONT 0.1 3.27    5.25
QUILT-A Illumina ht 0.12    1.38    1.64
QUILT-A Illumina    0.12    1.58    2.05
QUILT-A ONT 0.16    3.2 4.65
GLIMPSE Illumina    0.14    2.13    2.57
GLIMPSE ONT 0.68    11.01   18.78

where disc = % discrepant sites i.e. percent of sites that are het in truth that are not het in test where pse2 = pse comparing all test sites (i.e. after lining up first het, % of sites where after taking say vec = the difference between first test hap and first truth hap, there is a difference adjacent values of vec where pse1 = pse comparing only test het sites (traditional pse)

So everything looks fine, similarly it looks good for other coverages as well. By the combined pse2 approach, this approach looks better with QUILT-A Illumina of 1.58 vs QUILT of 1.85 and GLIMPSE (one of the original versions, not sure if current) of 2.13.

I'll go ahead then later today I hope and make sure this is tested, then push into master and eventually release a new version.