Unambiguous nomenclature for bialleleic Amino Acids?

brettva commented 3 years ago

Thank you for making this amazing resource.

In the imputed files produced from running SNP2HLA, it seem not possible to tell what the alleles are referring to for bialleleic amino acids.

So if you run the example:

python -m SNP2HLA \
    --target SNP2HLA/example/1958BC \
    --reference resources/1000G.bglv4 \
    --out MySNP2HLA/IMPUTED.1958BC \
    --nthreads 2 \
    --mem 4g

and inspect:

zcat MySNP2HLA/IMPUTED.1958BC.bgl.phased.vcf.gz | grep "AA_A_80_29910771_exon2" | less -S

which gives this:

6 29910771 AA_A_80_29910771_exon2 T A . PASS AR2=0.78;DR2=0.86;AF=0.12;IMP

As far as I an tell there is no way to tell what REF and ALT are for that example site, and any other bialleleic amino acid like it from this file alone.

I think I realize that you can make a script to parse the dictionaries to determine that this example bialleleic site is probably referring to these two amino acids:

'I' and 'T'

Please correct me if I am wrong, but if the imputed files are separated from the dictionaries, I don't know if anyone would be able to unambiguously determine this.

One thought I was having was to regenerate the reference panel so that bialleleic sites get the same treatment as multiallelics, where there is a new record for each amino acid at the given site, like this:

6       29910784        AA_A_80_29910771_exon2_I        T       A       .       PASS    AF=0.88 
6       29910785        AA_A_80_29910771_exon2_T        T       A       .       PASS    AF=0.12

This way you actually know what amino acids the bialleleic site is referring without doing a bunch of extra work. From a code perspective seems pretty easy to do. Am I missing anything here that would lead to problems? Is there a reason why bialleleics were not represented this way? Please let me know and thank you again!

brettva commented 3 years ago

@WansonChoi Hi I see a lot of the commits are from you, are you maintaining this project or do you know who is? I am wondering who best to go to with this possible issue. Thanks!

WansonChoi commented 3 years ago

@brettva Hi I checked this issue last night. I didn't realize github doesn't notify me unless referenced by '@' though new issue is opened. Above all, I do appreciate your interest in HLA-TAPAS.

As to the issue, I think the '*.bim' file of the example reference panel('resources/1000G.bglv4.bim') for the SNP2HLA example command contains wrong data. The ATtrick sholdn't have been applied to any markers in the BIM file. It's applied to only multi-allelic markers in VCF file. In other words, MakeReference(v2) generates BIM file of reference panel which has markers with original allele characters (i.e. ATtrick is not applied) while ATtrick is applied to only multi-allelic markers in (phased or imputed) VCF file. I originally intended an user can figure out which original allele characters are for 'A' and 'T' for a marker in VCF by checking BIM file of reference panel. You can check this if you run the example command of 'MakeReference' module and inspect the bim file in its output.

However, I agree an user might need output VCF file that has original allele characters instead of 'A' and 'T'. I should add a script for this parsing job(undoing ATtrick in VCF) as you suggested. Thank you for your good point.

brettva commented 3 years ago

@WansonChoi Greetings,! you probably know this you hit "watch" on repo homepage it should notify you of issues.

Thanks for responding. In our case we are a genomics initiative doing imputation that will be distributed to users via the VCF, so allowing our users to find allele from VCF alone would be extremely useful, but we could always distribute the bim too if the A,P status is encoded in there.

So your approach would take out the A,T coding in the VCF for bialleleic sites and let it stay A,P in the VCF? If so I think that might cause issues for downstream applications (eg converting to plink or parsing with high level VCF tools)

My experimenting so far has been to change line 120 and 233 MakeReference/src/encodeVariants.py to be >=2 instead of 2 so biallelics are basically split the same way as multiallelic are and this preserves the A,T hack. I have no idea if this reasonable or not though or if there are unintended consequences and I think this might actually result in two peaks in association tests every time a bialleleic site is significant, but I haven't tested.

One other thing, would it be possible to include the input needed to completely regenerate the full 1000 genomes panel resources/1000G.bglv4.bgl.phased.vcf.gz? Is see the subset, but it would be nice for sanity checks to have the data to recreate the full thing if its not too big.

Thank you!

WansonChoi commented 3 years ago

@brettva

So your approach would take out the A,T coding in the VCF for bialleleic sites and let it stay A,P in the VCF? If so I think that might cause issues for downstream applications (eg converting to plink or parsing with high level VCF tools)

Actually, this is one of reasons why I chose to leave imputed VCF file without undoing ATtrick. Most of VCF related software packages permit only A, C, G, T and '.'(something for NA value). It means Undoing ATtrick to imputed VCF file not only can help user's downstream anlaysis but also can hinder user's downstream file processing. For example, I sometimes need to convert imputed VCF file to Beagle file with 'vcf2beagle.jar (https://faculty.washington.edu/browning/beagle_utilities/utilities.html)' but this software can't operate if there is an allele character other than A, C, G, T and '.'. If an user needs to perform at least one (or more) additional downstream file processing, I'd recommend applying undoing ATtrick at the last step of downstream analysis.

My experimenting so far has been to change line 120 and 233 MakeReference/src/encodeVariants.py to be >=2 instead of 2 so biallelics are basically split the same way as multiallelic are and this preserves the A,T hack. I have no idea if this reasonable or not though or if there are unintended consequences and I think this might actually result in two peaks in association tests every time a bialleleic site is significant, but I haven't tested.

Nice manipulation! However, I suggest you stick to the original approach for bi-allelic markers. Most of genomic software are dedicated to processing bi-allelic markers and multi-allelic markers are often not allowed (as far as I know). The idea to separately represent alleles of multi-allelic locus with binary markers was introduced for this and has been applied to only multi-allelic markers. Also, a Bi-allelic marker is actually a binary marker itself. Applying this idea to a bi-allelic marker is redundant.

One other thing, would it be possible to include the input needed to completely regenerate the full 1000 genomes panel resources/1000G.bglv4.bgl.phased.vcf.gz? Is see the subset, but it would be nice for sanity checks to have the data to recreate the full thing if its not too big.

SNP data from here(https://www.internationalgenome.org/data) and HLA type data from here(https://www.internationalgenome.org/category/hla/) can be used. You can create a new 1000 Genome reference panel with your own file-preprocessing and QCs.

Also, Because these data are not small and users have to check citation information, I'm not going to add these data into this repository directly.

brettva commented 3 years ago

@WansonChoi

Sorry to bug you.

I downloaded SNP data from here(https://www.internationalgenome.org/data) and HLA type data from here(https://www.internationalgenome.org/category/hla/)

So the call set of resources/1000G.bglv4.bgl.phased.vcf.gz is not based on HLA*LA inferences from Luo et al?

Also MakeReference/example/g1k_subset_snps.bim is essentially the list of SNPS that were input for creating resources/1000G.bglv4.bgl.phased.vcf.gz? These sites are all referring to forward strand on GRCh37, correct? It seems so from my checks, but if you remember please let me know if this is the case. Thank you!

WansonChoi commented 3 years ago

@brettva

Sorry for my one confusion. I thought I introduced the 'resources/1000G.bglv4.bgl.phased.vcf.gz' data, but Yang Luo did it (https://github.com/immunogenomics/HLA-TAPAS/commit/db6dec600fd544e5151cf2be8790784fde94a507#diff-41d311a605520fc7b8b9a980a79437a26e26cebcbfdd569dd78d30f7cb3e7237). I was confused because I have created a similar 1000G reference panel before with those data that I mentioned.

So, I think you should ask this issue to Yang.

I'll make some revision to what I mentioned as to what input data was used or which HLA type resolution was used for the 'resources/1000G.bglv4.bgl.phased.vcf.gz' data.

brettva commented 3 years ago

@WansonChoi I have been in contact with Yang about this, I appreciate that you made a response here. Thank you

immunogenomics / HLA-TAPAS

Unambiguous nomenclature for bialleleic Amino Acids? #3