saigegit / SAIGE

Development for SAIGE and SAIGE-GENE(+)
GNU General Public License v3.0
64 stars 27 forks source link

Producing groupFiles from entire chromosome data #104

Open Ankeet3 opened 1 year ago

Ankeet3 commented 1 year ago

Hi

I have been using SAIGE-GENE+ (specifically the GBAT test) extensively for the past couple of months. I now want to run the GBAT analysis with my own groupFiles. I am generating these groupFiles by parsing the genomic data of UKBB's latest 500K release. The data is released, segregated by chromosomes. So I generally produce 1 groupFile per chromosome. I am doing this because I am using a different pLof standard when compared to the pLof standard for the default saige groupFiles provided on the documentation. The procedure I follow generally is annotating the entire chromosome data with VEP, and then grouping it according to the gene_symbols (one of the output fields of VEP). The issue here lies with the fact that VEP annotates gene_symbols according to the transcripts. Hence, I am getting wrong gene_symbol annotations for particular variants, which is leading to wrong groupFiles and hence, wrong GBAT results. I want to know how to go about annotating my variants with the appropriate gene_symbol so that I can produce an accurate groupFile. Any help here would be gladly appreciated.

evatosco commented 1 year ago

Hi!

I don't know if this is what you mean, but I think the annotation field you might be looking for is "SYMBOL" in VEP fields in your VCF file. As far as I know, "Gene" is the name of the Ensembl transcript, as you mentioned, but "SYMBOL" defines the HGNC name of a specific gene, and in my experience, always shows the same name even if a variant is annotated with multiple transcripts.

Hope it helps!