rwdavies / QUILT

QUILT: Low coverage whole genome sequence imputation with large reference panels
https://www.nature.com/articles/s41588-021-00877-0
GNU General Public License v3.0
54 stars 11 forks source link

Make HLA supplementary file #8

Open Zepeng-Mu opened 2 years ago

Zepeng-Mu commented 2 years ago

Hi, I'm wondering how I can make a --quilt_hla_supplementary_info_file for hg19. I guess I can liftOver the file provided in the Github repo, but I hope to add more genes, like c('A','B','C','DPA1','DPB1','DQA1','DQB1','DRB1').

Thanks so much!

rwdavies commented 2 years ago

So this file https://github.com/rwdavies/QUILT/blob/master/hla_ancillary_files/quilt_hla_supplementary_info.txt contains three columns. Let's take a look at the help entry in QUILT_HLA_prepare_reference.R

Path to file with supplementary information about the genes, necessary for proper converstion. File is tab separated with header, with 3 columns. First (allele) is the allele that matches the reference genome. Second (genome_pos) is the position of this allele in the reference genome, finally the strand (strand) (options 1 or -1)

I'll check with Simon Myers, who I'm pretty sure made the file, about any tips for making a new one, and get back to you. I can imagine how to make a new one, it is just a lot of checking (especially the first column, which could be automated, but is a hassle). Sorry otherwise for my slow reply, my daughter is home from nursery sick so I am slower at replying to emails / answering github issues. Thanks, Robbie

danielanach commented 2 years ago

@Zepeng-Mu I am making a supplementary table now for a couple other genes I needed and it looks like information needed for the rest of the genes are listed in IMGT/HLA database here: https://www.ebi.ac.uk/ipd/imgt/hla/help/genomics.html

Specifically:

although these are all in hg38 and not 100% sure the alleles that match the reference are the same between the reference versions..

danielanach commented 2 years ago

Just one more update on this, I ended up not being able to make reference files for other HLA genes because they are not available in the 1000G reference file 20181129_HLA_types_full_1000_Genomes_Project_panel.txt, which only includes HLA typing for HLA-A, B, C, DRB1, and DQB1.

rwdavies commented 2 years ago

Hi, so I changed the code to completely remove the dependencies for these files, and to start from more obvious dependencies. The code is written and runs to completion. I had meant to re-run the pipeline I used in the paper with a few different versions of the reference package to make sure I hadn't broken any functionality before pushing. Let me get back to you on this.

Zepeng-Mu commented 2 years ago

Sounds great! I would like to try the newer version with fewer dependencies when it's available.

Zepeng-Mu commented 2 years ago

Hello, I'm trying to use the new 1.0.3 version to prepare HLA reference. I found that many GRCh38 files from 1000G has no counterpart in GRCh37, or is very hard to find. I'm wondering whether it's possible to build a reference file using GRCh37? Thanks!

rwdavies commented 2 years ago

Hi both,

I've had a busy start to term so got really behind on things including this. I'm getting back up on my non-teaching activities now.

I properly pushed the new version of the code to the repository now. I used it to build a new reference package which is on the main QUILT HLA page https://github.com/rwdavies/QUILT/blob/master/README_QUILT-HLA.md#paragraph-reference-packages

I tested it versus the old version, and it worked, through performance in some alleles is a bit down for non-Europeans. I think this is because I'm not using HRC for the imputation but 1000 Genomes but need to check. I think I thought that wouldn't make much of a difference, but want to go back and properly benchmark that now.

Best, Robbie