huguesrichard / Allopipe

AlloPipe is a computational method to assess the alloreactivity expected from a donor/recipient transplantation pair
MIT License
0 stars 0 forks source link

AMS for multiallelic sites with same gnomad AF may be overestimated #45

Open aheinzel opened 1 month ago

aheinzel commented 1 month ago

This issue is related to (https://github.com/huguesrichard/Allopipe/issues/44) and refers to the special case where all alt alleles of a multiallelic site have the same gnomad AF (should also work when gnomad AF is only reported for one of the multiple alt alleles - not tested). Such variants will pass the restriction of only having a single AF value present and thus will get scored.

Currently VEP CSQ info parsing strategy will aggregate all alt amino acids irrespective of an individuals genotype. In case of a multiallelic sites different amino acids may be reported for each of the alt alleles. For better illustration I include an example for chr1 11828561 chr1_11828561_G_C;chr1_11828561_G_A G C,A. The following INFO field contains CSQ annotation for both alt alleles

AF=0.00021,0.00021;AQ=49,45;CSQ=C|missense_variant|MODERATE|CLCN6|ENSG00000011021|Transcript|ENST00000312413|protein_coding|11/22||||1159|992|331|C/S|tGt/tCt|rs745893586&COSV100411854&COSV56742154&COSV56743438||1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0.000163|0|uncertain_significance|0&1&1&1|1&1&1&1,A|missense_variant|MODERATE|CLCN6|ENSG00000011021|Transcript|ENST00000312413|protein_coding|11/22||||1159|992|331|C/Y|tGt/tAt|rs745893586&COSV100411854&COSV56742154&COSV56743438||1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0|3.266e-05||0&1&1&1|1&1&1&1,C|missense_variant|MODERATE|CLCN6|ENSG00000011021|Transcript|ENST00000346436|protein_coding|12/23||||1130|1058|353|C/S|tGt/tCt|rs745893586&COSV100411854&COSV56742154&COSV56743438||1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0.000163|0|uncertain_significance|0&1&1&1|1&1&1&1,A|missense_variant|MODERATE|CLCN6|ENSG00000011021|Transcript|ENST00000346436|protein_coding|12/23||||1130|1058|353|C/Y|tGt/tAt|rs745893586&COSV100411854&COSV56742154&COSV56743438||1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0|3.266e-05||0&1&1&1|1&1&1&1,C|downstream_gene_variant|MODIFIER|CLCN6|ENSG00000011021|Transcript|ENST00000376490|protein_coding_CDS_not_defined||||||||||rs745893586&COSV100411854&COSV56742154&COSV56743438|343|1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0.000163|0|uncertain_significance|0&1&1&1|1&1&1&1,A|downstream_gene_variant|MODIFIER|CLCN6|ENSG00000011021|Transcript|ENST00000376490|protein_coding_CDS_not_defined||||||||||rs745893586&COSV100411854&COSV56742154&COSV56743438|343|1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0|3.266e-05||0&1&1&1|1&1&1&1,C|downstream_gene_variant|MODIFIER|CLCN6|ENSG00000011021|Transcript|ENST00000376491|protein_coding_CDS_not_defined||||||||||rs745893586&COSV100411854&COSV56742154&COSV56743438|343|1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0.000163|0|uncertain_significance|0&1&1&1|1&1&1&1,A|downstream_gene_variant|MODIFIER|CLCN6|ENSG00000011021|Transcript|ENST00000376491|protein_coding_CDS_not_defined||||||||||rs745893586&COSV100411854&COSV56742154&COSV56743438|343|1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0|3.266e-05||0&1&1&1|1&1&1&1,C|intron_variant&non_coding_transcript_variant|MODIFIER|CLCN6|ENSG00000011021|Transcript|ENST00000376492|protein_coding_CDS_not_defined||11/11||||||||rs745893586&COSV100411854&COSV56742154&COSV56743438||1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0.000163|0|uncertain_significance|0&1&1&1|1&1&1&1,A|intron_variant&non_coding_transcript_variant|MODIFIER|CLCN6|ENSG00000011021|Transcript|ENST00000376492|protein_coding_CDS_not_defined||11/11||||||||rs745893586&COSV100411854&COSV56742154&COSV56743438||1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0|3.266e-05||0&1&1&1|1&1&1&1,C|missense_variant|MODERATE|CLCN6|ENSG00000011021|Transcript|ENST00000376496|protein_coding|12/22||||1061|1058|353|C/S|tGt/tCt|rs745893586&COSV100411854&COSV56742154&COSV56743438||1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0.000163|0|uncertain_significance|0&1&1&1|1&1&1&1,A|missense_variant|MODERATE|CLCN6|ENSG00000011021|Transcript|ENST00000376496|protein_coding|12/22||||1061|1058|353|C/Y|tGt/tAt|rs745893586&COSV100411854&COSV56742154&COSV56743438||1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0|3.266e-05||0&1&1&1|1&1&1&1,C|missense_variant&NMD_transcript_variant|MODERATE|CLCN6|ENSG00000011021|Transcript|ENST00000400892|nonsense_mediated_decay|12/27||||1090|1058|353|C/S|tGt/tCt|rs745893586&COSV100411854&COSV56742154&COSV56743438||1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0.000163|0|uncertain_significance|0&1&1&1|1&1&1&1,A|missense_variant&NMD_transcript_variant|MODERATE|CLCN6|ENSG00000011021|Transcript|ENST00000400892|nonsense_mediated_decay|12/27||||1090|1058|353|C/Y|tGt/tAt|rs745893586&COSV100411854&COSV56742154&COSV56743438||1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0|3.266e-05||0&1&1&1|1&1&1&1,C|upstream_gene_variant|MODIFIER|CLCN6|ENSG00000011021|Transcript|ENST00000494028|protein_coding_CDS_not_defined||||||||||rs745893586&COSV100411854&COSV56742154&COSV56743438|1288|1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0.000163|0|uncertain_significance|0&1&1&1|1&1&1&1,A|upstream_gene_variant|MODIFIER|CLCN6|ENSG00000011021|Transcript|ENST00000494028|protein_coding_CDS_not_defined||||||||||rs745893586&COSV100411854&COSV56742154&COSV56743438|1288|1||HGNC|HGNC:2024|3.977e-06|0|0|0|0|0|0|0|3.266e-05||0&1&1&1|1&1&1&1

To make this more readable I restrict the info to relevant fields. The following can be reproduced with the commands provided:

#Att make sure that delim for cut (-d) is tab
cat ../../test_case_multi_allelic_same_af/indiv1.vcf \
   | tail -1 \
   | cut -d "   " -f8 \
   | sed -E 's/.*CSQ=([^;]+).*/\1/' \
   | tr "," "\n" \
   | sort \
   | cut -d "|" -f1,2,16,24
A|downstream_gene_variant||3.977e-06
A|downstream_gene_variant||3.977e-06
A|intron_variant&non_coding_transcript_variant||3.977e-06
A|missense_variant&NMD_transcript_variant|C/Y|3.977e-06
A|missense_variant|C/Y|3.977e-06
A|missense_variant|C/Y|3.977e-06
A|missense_variant|C/Y|3.977e-06
A|upstream_gene_variant||3.977e-06
C|downstream_gene_variant||3.977e-06
C|downstream_gene_variant||3.977e-06
C|intron_variant&non_coding_transcript_variant||3.977e-06
C|missense_variant&NMD_transcript_variant|C/S|3.977e-06
C|missense_variant|C/S|3.977e-06
C|missense_variant|C/S|3.977e-06
C|missense_variant|C/S|3.977e-06
C|upstream_gene_variant||3.977e-06

An individual with genotype 0/1 would be carrying both the G and the C allele --> AAs: C and S but not Y. However, when this individual is scored against a 0/0 GT the AMS is calculated to be 2. In the mismatch table (has been transposed for readability) we can see that the individual 0/1 is thought to also have the Y AA which however would only result out of the G>A substitution.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

CHROM | 1 -- | -- POS | 11828561 ID_x | chr1_11828561_G_C;chr1_11828561_G_A REF | G ALT | C,A QUAL_x | 49 FILTER_x | . FORMAT_x | ['GT', 'DP', 'AD', 'GQ', 'PL', 'RNC'] GT_x | 0/0 GQ_x | 48 AD_x | 16,0,0 phased_x | 0/0 DP_x | 16 TYPE_x | homozygous missense_variant_x | 8 transcripts_x | ENST00000312413,ENST00000346436,ENST00000376490,ENST00000376491,ENST00000376492,ENST00000376496,ENST00000400892,ENST00000494028 genes_x | ENSG00000011021 aa_REF | C aa_ALT | S,Y gnomADe_AF_x | 3.98E-06 aa_ref_indiv_x | C aa_alt_indiv_x |   aa_indiv_x | C ID_y | chr1_11828561_G_C;chr1_11828561_G_A QUAL_y | 49 FILTER_y | . FORMAT_y | ['GT', 'DP', 'AD', 'GQ', 'PL', 'RNC'] GT_y | 0/1 GQ_y | 43 AD_y | 28,39,0 phased_y | 0/1 DP_y | 67 TYPE_y | heterozygous missense_variant_y | 8 transcripts_y | ENST00000312413,ENST00000346436,ENST00000376490,ENST00000376491,ENST00000376492,ENST00000376496,ENST00000400892,ENST00000494028 genes_y | ENSG00000011021 gnomADe_AF_y | 3.98E-06 aa_ref_indiv_y | C aa_alt_indiv_y | S,Y aa_indiv_y | C,S,Y diff | S,Y mismatch | 2 mismatch_type | heterozygous

Minimal test case for reproducibility (files are attached):

python ams_pipeline.py -f -n ma -p i1i2 --min_dp 0 --min_ad 0 --min_gq 0 --gnomad_af 0.000000000001 ../../test_case_multiallelic_same_af/indiv1.vcf ../../test_case_multiallelic_same_af/indiv2.vcf rd
aheinzel commented 1 month ago

Actually multiallelic sites may give rise to too high or too low AMS. The case in which a multiallelic site results in a too high AMS has been described above. If we compare at the same site to heterozygous individuals where one has GT 0/1 and the other one GT 0/2 a mismatch of zero is reported for AMS. The reason is that both of the heterozygous individuals are thought to have all three variants of the protein, the wildtype with C and the two mutated forms with either S or Y.

For better illustration I include the indiv tables generated by AlloPipe <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

CHROM | 1 -- | -- POS | 11828561 ID | chr1_11828561_G_C;chr1_11828561_G_A REF | G ALT | C,A QUAL | 49 FILTER | . FORMAT | ['GT', 'DP', 'AD', 'GQ', 'PL', 'RNC'] GT | 0/1 GQ | 43 AD | 28,39,0 phased | 0/1 DP | 67 TYPE | heterozygous missense_variant | 8 transcripts | ENST00000312413,ENST00000346436,ENST00000376490,ENST00000376491,ENST00000376492,ENST00000376496,ENST00000400892,ENST00000494028 genes | ENSG00000011021 aa_REF | C aa_ALT | S,Y gnomADe_AF | 3.98E-06 aa_ref_indiv | C aa_alt_indiv | S,Y aa_indiv | C,S,Y

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

CHROM | 1 -- | -- POS | 11828561 ID | chr1_11828561_G_C;chr1_11828561_G_A REF | G ALT | C,A QUAL | 49 FILTER | . FORMAT | ['GT', 'DP', 'AD', 'GQ', 'PL', 'RNC'] indiv2 | 0/2:52:33,0,19:43:45,990,990,0,990,46:.. GT | 0/2 GQ | 43 AD | 33,0,19 phased | 0/2 DP | 52 TYPE | heterozygous missense_variant | 8 transcripts | ENST00000312413,ENST00000346436,ENST00000376490,ENST00000376491,ENST00000376492,ENST00000376496,ENST00000400892,ENST00000494028 genes | ENSG00000011021 aa_REF | C aa_ALT | S,Y gnomADe_AF | 3.98E-06 aa_ref_indiv | C aa_alt_indiv | S,Y aa_indiv | C,S,Y

Command used for the run:

python ams_pipeline.py -f -n ma -p i1i2 --min_dp 0 --min_ad 0 --min_gq 0 --gnomad_af 0.000000000001 ../../test_case_multiallelic_same_af/indiv2.vcf ../../test_case_multiallelic_same_af/indiv3.vcf rd

Files for testing purposes are added to this post