Problems with input data

Alina-Song commented 2 years ago

Hi,Sorry to bother you @Fnyasimi. when I used the sample data(cad.add.160614.website), I found that the results of harmonization(Harmonization, not Quick Harmonization) and imputation were the same whether the markername entered was an rsid (input format 1) or chr:pos:A1:A2 (input format 2). The results of S-Predixcan were also consistent. Therefore, I guess when you use liftover to convert position information of the SNP in order to get panel_variant_id , you don't need the rsid, only need the CHR, Position, A1, A2 of the SNP. Since a lot of the data I'm using is not directly given to the rsid, I am eager to determine if I can use format 2 to run the following S-Predixcan code on my own GWAS summary data.

input format 1: markername1 chr bp_hg19 effect_allele noneffect_allele effect_allele_freq median_info beta se_dgc p_dgc ... rs143225517 1 751756 C T 0.158264 0.92 0.013006 0.017324 0.4528019 rs3094315 1 752566 A G 0.763018 1 -0.005243 0.0157652 0.7394597 ...

input format 2: markername2 chr bp_hg19 effect_allele noneffect_allele effect_allele_freq median_info beta se_dgc p_dgc ... 1:751756:T:C 1 751756 C T 0.158264 0.92 0.013006 0.017324 0.4528019 1:752566:G:A 1 752566 A G 0.763018 1 -0.005243 0.0157652 0.7394597 ...

code: python $GWAS_TOOLS/gwas_parsing.py -gwas_file $DATA/gwas_try/cad_snp.txt -liftover $DATA/liftover/hg19ToHg38.over.chain.gz -snp_reference_metadata $DATA/reference_panel_1000G/variant_metadata.txt.gz METADATA -output_column_map markername variant_id -output_column_map noneffect_allele non_effect_allele -output_column_map effect_allele effect_allele -output_column_map beta effect_size -output_column_map p_dgc pvalue -output_column_map chr chromosome --chromosome_format -output_column_map bp_hg19 position -output_column_map effect_allele_freq frequency --insert_value sample_size 184305 --insert_value n_cases 60801 -output_order variant_id panel_variant_id chromosome position effect_allele non_effect_allele frequency pvalue zscore effect_size standard_error sample_size n_cases -output $OUTPUT/harmonized_gwas/cad_snp_harmo.txt.gz

for sub_batch in {0..9}; do GWAS_TOOLS=/home/sln/summary-gwas-imputation-master/src DATA=/home/sln/summary-gwas-imputation-master/sample_data/data OUTPUT=/home/sln/summary-gwas-imputation-master/sample_data/test python $GWAS_TOOLS/gwas_summary_imputation.py -by_region_file $DATA/eur_ld.bed.gz -gwas_file $OUTPUT/harmonized_gwas/cad_snp_harmo.txt.gz -parquet_genotype $DATA/reference_panel_1000G/chr1.variants.parquet -parquet_genotype_metadata $DATA/reference_panel_1000G/variant_metadata.parquet -window 100000 -parsimony 7 -chromosome 1 -regularization 0.1 -frequency_filter 0.01 -sub_batches 10 -sub_batch $sub_batch --standardise_dosages -output $OUTPUT/summary_imputation/cad_snp_harmo_chr1_sb$sub_batch.txt.gz done

python $GWAS_TOOLS/gwas_summary_imputation_postprocess.py -gwas_file $OUTPUT/harmonized_gwas/cad_snp_harmo.txt.gz -folder $OUTPUT/summary_imputation -pattern cad_snp_harmo.* -parsimony 7 -output $OUTPUT/processed_summary_imputation/imputed_cad_snp_harmo.txt.gz

python $METAXCAN/SPrediXcan.py --gwas_file $OUTPUT/processed_summary_imputation/imputed_cad_snp_harmo.txt.gz --snp_column panel_variant_id --effect_allele_column effect_allele --non_effect_allele_column non_effect_allele --zscore_column zscore --model_db_path $DATA/models/eqtl/mashr/mashr_Whole_Blood.db --covariance $DATA/models/eqtl/mashr/mashr_Whole_Blood.txt.gz --keep_non_rsid --additional_output --model_db_snp_key varID --throw --output_file $OUTPUT/spredixcan/imputed_cad_snp_harmo_Whole_Blood.csv

Alina-Song commented 2 years ago

I sincerely look forward to your reply and I hope you can give me some suggestions when you have time @Fnyasimi .

Mia1037 commented 2 years ago

hi, I just started running S-Predixcan. According to this tutorial , the SNP format I input is also CHR: POS :A1:A1, and everything runs smoothly without any error. It seems that rsid is required for Quick Harmonization. I don't know if my understanding is right, but I still need the author to confirm @hakyim @Fnyasimi.

Fnyasimi commented 2 years ago

The harmonization step requires this argument to be present -snp_reference_metadata $DATA/reference_panel_1000G/variant_metadata.txt.gz METADATA. Your summary stats genome build should be the same as the reference metadata or you need to do a liftover. The metadata file contains information used to generate the panel_variant_id which matches with the ones in the model. The matching is done using chr and pos and it also changes the sign of the zscores in cases where the ref and alt are flipped.

hakyimlab / MetaXcan

Problems with input data #163