The variant_id in GWAS Summary data can be CHR:POS:REF:ALT instead of rsid

Alina-Song commented 2 years ago

Hello,

We know many GWAS Summary data do not give an rsid ,but CHR:POS:REF:ALT(e.g. 1:817213:T:G), So I would like to know if I can use CHR:POS:REF:ALT instead of rsid for harmonization and imputation and subsequent analysis. Any ideas or advice is kindly appreciated.

Thank you, Alina

Fnyasimi commented 2 years ago

Hi @Alina-Song you need rsid to for harmonization and imputation. For PrediXcan and S-PrediXcan you can use eithe rsid or varID in the format (CHR_POS_REF_ALT_build) which should be the same build as the one in the models.

Alina-Song commented 2 years ago

Thanks a lot @Fnyasimi . This means I can use varID in the format (CHR_POS_REF_ALT) built in GRCh37 to for harmonization and imputation ( to get panel_variant_id, for example, chr1_54490_G_A_b38), and then run S-PrediXcan.

Fnyasimi commented 2 years ago

For the harmonization and imputation step you need rsids to enable you generate panel_variant_id through matching with the GTeX mapping file.

Alina-Song commented 2 years ago

Hi,Sorry to bother you @Fnyasimi. But when I used the sample data(cad.add.160614.website), I found that the results of harmonization(Harmonization, not Quick Harmonization) and imputation were the same whether the markername entered was an rsid (input format 1) or chr:pos:A1:A2 (input format 2). The results of S-Predixcan were also consistent. Therefore, I guess when you use liftover to convert position information of the SNP in order to get panel_variant_id , you don't need the rsid, only need the CHR, Position, A1, A2 of the SNP. Since a lot of the data I'm using is not directly given to the rsid, I am eager to determine if I can use format 2 to run the following S-Predixcan code on my own GWAS summary data.

input format 1:

markername1 chr bp_hg19 effect_allele noneffect_allele effect_allele_freq median_info beta se_dgc p_dgc ... rs143225517 1 751756 C T 0.158264 0.92 0.013006 0.017324 0.4528019
rs3094315 1 752566 A G 0.763018 1 -0.005243 0.0157652 0.7394597
...

input format 2:

markername2 chr bp_hg19 effect_allele noneffect_allele effect_allele_freq median_info beta se_dgc p_dgc ... 1:751756:T:C 1 751756 C T 0.158264 0.92 0.013006 0.017324 0.4528019
1:752566:G:A 1 752566 A G 0.763018 1 -0.005243 0.0157652 0.7394597
...

code:

python $GWAS_TOOLS/gwas_parsing.py \ -gwas_file $DATA/gwas_try/cad_snp.txt \ -liftover $DATA/liftover/hg19ToHg38.over.chain.gz \ -snp_reference_metadata $DATA/reference_panel_1000G/variant_metadata.txt.gz METADATA \ -output_column_map markername variant_id \ -output_column_map noneffect_allele non_effect_allele \ -output_column_map effect_allele effect_allele \ -output_column_map beta effect_size \ -output_column_map p_dgc pvalue \ -output_column_map chr chromosome \ --chromosome_format \ -output_column_map bp_hg19 position \ -output_column_map effect_allele_freq frequency \ --insert_value sample_size 184305 --insert_value n_cases 60801 \ -output_order variant_id panel_variant_id chromosome position effect_allele non_effect_allele frequency pvalue zscore effect_size standard_error sample_size n_cases \ -output $OUTPUT/harmonized_gwas/cad_snp_harmo.txt.gz

for sub_batch in {0..9}; do GWAS_TOOLS=/home/sln/summary-gwas-imputation-master/src DATA=/home/sln/summary-gwas-imputation-master/sample_data/data OUTPUT=/home/sln/summary-gwas-imputation-master/sample_data/test python $GWAS_TOOLS/gwas_summary_imputation.py \ -by_region_file $DATA/eur_ld.bed.gz \ -gwas_file $OUTPUT/harmonized_gwas/cad_snp_harmo.txt.gz \ -parquet_genotype $DATA/reference_panel_1000G/chr1.variants.parquet \ -parquet_genotype_metadata $DATA/reference_panel_1000G/variant_metadata.parquet \ -window 100000 \ -parsimony 7 \ -chromosome 1 \ -regularization 0.1 \ -frequency_filter 0.01 \ -sub_batches 10 \ -sub_batch $sub_batch \ --standardise_dosages \ -output $OUTPUT/summary_imputation/cad_snp_harmo_chr1_sb$sub_batch.txt.gz done

python $GWAS_TOOLS/gwas_summary_imputation_postprocess.py \ -gwas_file $OUTPUT/harmonized_gwas/cad_snp_harmo.txt.gz \ -folder $OUTPUT/summary_imputation \ -pattern cad_snp_harmo.* \ -parsimony 7 \ -output $OUTPUT/processed_summary_imputation/imputed_cad_snp_harmo.txt.gz

python $METAXCAN/SPrediXcan.py \ --gwas_file $OUTPUT/processed_summary_imputation/imputed_cad_snp_harmo.txt.gz \ --snp_column panel_variant_id --effect_allele_column effect_allele --non_effect_allele_column non_effect_allele --zscore_column zscore \ --model_db_path $DATA/models/eqtl/mashr/mashr_Whole_Blood.db \ --covariance $DATA/models/eqtl/mashr/mashr_Whole_Blood.txt.gz \ --keep_non_rsid --additional_output --model_db_snp_key varID \ --throw \ --output_file $OUTPUT/spredixcan/imputed_cad_snp_harmo_Whole_Blood.csv

hakyimlab / MetaXcan