hakyimlab / MetaXcan

MetaXcan software and manuscript
Other
141 stars 92 forks source link

Predict.py with imputation data #160

Open Hannah430 opened 2 years ago

Hannah430 commented 2 years ago

Hi,

I have been trying to run PrediXcan using imputed data (by TopMED). however, the program has been running for many days, and I am wondering if it is normal and whether I did something wrong.

The imputed data is already in hg38 build. and here's an example of of input: chr21:10017726:G:A

Here is the script: python3 ./Predict.py \ --model_db_path $MODEL/mashr_Brain_Frontal_Cortex_BA9.db \ --model_db_snp_key varID \ --vcf_genotypes $DATA/chr*.clean.noMono.vcf.gz \ --vcf_mode imputed \ --on_the_flymapping METADATA "{}{}{}{}_b38" \ --prediction_output $RESULTS/predict.txt \ --prediction_summary_output $RESULTS/predict_summary.txt \ --verbosity 9 \ --throw

I am also wondering if there's a way to see how many SNPs have been used by the PrediXcan model? I did a trial run with data fro one chromosome, and the _summary.txt shows "NA" for the majority of the predicted gene.

Thank you! Hannah

Fnyasimi commented 2 years ago

Running for many days is already an issue, kindly check out if there is an issue on the log. Ideally it should take a short time to run and the software reports the % of the model's SNPs used for prediction

Hannah430 commented 2 years ago

Hi,

So far, the log just says that it is processing the vcf files. I'm wondering if it is okay to input genotype data in many files according to chromosome number. Also I am also wondering if the --on_the_fly_mapping METADATA "{}_{}_{}_{}_b38 is correct for my case (an example of my input: chr21:10017726:G:A).

thank you! Hannah

Fnyasimi commented 2 years ago

You can input the genotype files for each chromosome using a wild card as you have done with your code above. You can use on the fly mapping argument to reconstruct the varID from the respective columns in the vcf file

Hannah430 commented 2 years ago

Hi,

I have tried your suggestions, and still hasn't fix the issue of the lengthy running time. I am wondering if it's normal, but how can I fix it if it's not normal.

also while running the program, there is a message indicating that the index file is older than the data file. I am wondering if this will cause a problem?

Thank you! Hannah