How do I interpret SNP2HLA output?

rwdavies commented 3 years ago

Hi,

I tried running a modified version of the README suggestion for HLA-TAPAS (code below), which gave the following example output for one individual for HLA-A (code to reproduce below as as well as log). Can you provide guidance on how to interpret this, if I'm specifically interested in 4-digit HLA types?

So these are the VCF entries for the first imputed sample for HLA-A

HLA_A*01        1|0:1.05:0.02,0.91,0.07
HLA_A*01:01:01:01       1|0:1.04:0.03,0.91,0.07
HLA_A*01:02     0|0:0.01:0.99,0.01,0
HLA_A*01:83:02  0|0:0.01:0.99,0.01,0
HLA_A*02        0|0:0.04:0.97,0.03,0
HLA_A*02:01:01:01       0|0:0.02:0.98,0.02,0
HLA_A*02:22:01:01       0|0:0.01:0.99,0.01,0
HLA_A*03        0|0:0.12:0.88,0.12,0
HLA_A*03:01:01:01       0|0:0.12:0.88,0.12,0
HLA_A*11        0|0:0.08:0.92,0.08,0
HLA_A*11:01:01:01       0|0:0.07:0.93,0.07,0
HLA_A*11:50Q    0|0:0.02:0.98,0.02,0
HLA_A*24        0|1:0.67:0.33,0.67,0
HLA_A*24:02:01:01       0|1:0.61:0.39,0.61,0
HLA_A*24:02:05  0|0:0.01:0.99,0.01,0
HLA_A*24:07:01  0|0:0.03:0.97,0.03,0
HLA_A*24:17     0|0:0.01:0.99,0.01,0
HLA_A*24:86N    0|0:0.01:0.99,0.01,0
HLA_A*36        0|0:0.01:0.99,0.01,0
HLA_A*36:01     0|0:0.01:0.99,0.01,0
HLA_A*68        0|0:0.01:0.99,0.01,0
HLA_A*68:02:01:01       0|0:0.01:0.99,0.01,0

My intuition is to a) disregard the 2-digit rows b) collapse >4 digit to 4-digit types c) sum 4-digit types. This gives something like the following, if I then sum by the second entry in (the DS = dosage) of the VCF

 1   HLA_A*01:01 1.04
 2   HLA_A*24:02 0.62
 3   HLA_A*03:01 0.12
 4   HLA_A*11:01 0.07
 5   HLA_A*24:07 0.03
 6   HLA_A*02:01 0.02
 7  HLA_A*11:50Q 0.02
 8   HLA_A*01:02 0.01
 9   HLA_A*01:83 0.01
 10  HLA_A*02:22 0.01
 11  HLA_A*24:17 0.01
 12 HLA_A*24:86N 0.01
 13  HLA_A*36:01 0.01
 14  HLA_A*68:02 0.01

My interpretation is that the most likely genotype for this sample would be HLA_A*01:01, HLA_A*24:02, with some sort of confidence (here perhaps confidence = (1.04 + 0.62) / (sum of the rest) = 0.83? If this is right, is there a recommended default confidence threshold (0.5? 0.9?)

Can you also confirm, for the imputation, that the only thing that matters from the reference samples is the *.phased.vcf.gz file? This is what this code seems to show

Thanks, Robbie

Code I used

set -e

rm -r -f test_target
mkdir test_target
##cp SNP2HLA/example/1958BC.vcf.gz test_target/
cp SNP2HLA/example/1958BC.bed test_target/
cp SNP2HLA/example/1958BC.bim test_target/
cp SNP2HLA/example/1958BC.fam test_target/

rm -r -f test_reference
mkdir test_reference
## cp resources/1000G.bglv4.bgl.phased.vcf.gz test_reference/
## cp resources/1000G.bglv4.bim test_reference/
## cp resources/1000G.bglv4.FRQ.frq test_reference/
## cp resources/1000G.bglv4.markers test_reference/
cp resources/1000G.bglv4.bgl.phased.vcf.gz test_reference/
touch test_reference/1000G.bglv4.bim
touch test_reference/1000G.bglv4.FRQ.frq
touch test_reference/1000G.bglv4.markers

rm -r -f test_output
mkdir -p test_output

python -m SNP2HLA \
   --target test_target/1958BC \
   --reference test_reference/1000G.bglv4 \
   --out test_output/IMPUTED.1958BC \
   --nthreads 4 \
   --mem 16g

gunzip -c test_output/IMPUTED.1958BC.bgl.phased.vcf.gz | grep -F "HLA_A*" | cut -f3,10 | grep -v "0|0:0:1,0,0"

Output of code run

Namespace(dependency='dependency/', mem='16g', niterations=5, nthreads=4, out='test_output/IMPUTED.1958BC', reference='test_reference/1000G.bglv4', target='test_target/1958BC', tolerated_diff=0.15)
SNP2HLA: Performing HLA imputation for dataset test_target/1958BC
- Java memory = 16gb
[1] Extracting SNPs from the MHC.
[2] Performing SNP quality control.
[3] Converting PLINK to BEAGLE format.
[4] Converting BEAGLE to VCF format.
[5] Performing HLA imputation.
HLA_A*01        1|0:1.05:0.02,0.91,0.07
HLA_A*01:01:01:01       1|0:1.04:0.03,0.91,0.07
HLA_A*01:02     0|0:0.01:0.99,0.01,0
HLA_A*01:83:02  0|0:0.01:0.99,0.01,0
HLA_A*02        0|0:0.04:0.97,0.03,0
HLA_A*02:01:01:01       0|0:0.02:0.98,0.02,0
HLA_A*02:22:01:01       0|0:0.01:0.99,0.01,0
HLA_A*03        0|0:0.12:0.88,0.12,0
HLA_A*03:01:01:01       0|0:0.12:0.88,0.12,0
HLA_A*11        0|0:0.08:0.92,0.08,0
HLA_A*11:01:01:01       0|0:0.07:0.93,0.07,0
HLA_A*11:50Q    0|0:0.02:0.98,0.02,0
HLA_A*24        0|1:0.67:0.33,0.67,0
HLA_A*24:02:01:01       0|1:0.61:0.39,0.61,0
HLA_A*24:02:05  0|0:0.01:0.99,0.01,0
HLA_A*24:07:01  0|0:0.03:0.97,0.03,0
HLA_A*24:17     0|0:0.01:0.99,0.01,0
HLA_A*24:86N    0|0:0.01:0.99,0.01,0
HLA_A*36        0|0:0.01:0.99,0.01,0
HLA_A*36:01     0|0:0.01:0.99,0.01,0
HLA_A*68        0|0:0.01:0.99,0.01,0
HLA_A*68:02:01:01       0|0:0.01:0.99,0.01,0

rwdavies commented 3 years ago

If there is some sort of parser that already does this that you can point me to that'd be great (e.g. takes as input the phased output file, and generates something like one row per sample, one column per HLA-region, with entry the most likely single genotype, and a confidence measure)

yluo86 commented 3 years ago

Hi @rwdavies, sorry about the the slow response - I have only recently realise that i need to turn the watch on before getting notified with all open issues. You are right in interpreting the result. As for filtering - we usually filter out variants with imputation r2 < 0.5 and minor allele frequency < 1% (depends on the size of you dataset).

All required information for downstream association analysis is included in the *.phased.vcf.gz file.

immunogenomics / HLA-TAPAS

How do I interpret SNP2HLA output? #6