opain / GenoPred

Genotype-based Prediction (GenoPred)
https://opain.github.io/GenoPred/
GNU General Public License v3.0
61 stars 20 forks source link

Questions about generating GeRS using FUSION_ref_scorer.R #75

Closed jianvhuang closed 1 year ago

jianvhuang commented 1 year ago

Hi Oliver,

I am trying to construct GeRS using the method you described in your paper Imputed gene expression risk scores: a functionally informed component of polygenic risk published on Human Molecular Genetics.

I obtained GTEx8 tissue-specific SNP-weights from FUSION webpage and performed tissue-specific TWAS using their R code FUSION.assoc_test.R.

I have also used your R code FUSION_score_file_creator.R to convert the FUSION SNP-weights into tissue-specific SCORE files. However, I got stuck at the FUSION_ref_scorer.R step.

1) According to the example code on your GitHub page [link], I believe I have everything ready except I am not sure what the pigz binary files should be. I have the pigz directory at ./pigz-2.7/, is this what I should put after --pigz ?

2) In both FUSION_score_file_creator.R and FUSION_ref_scorer.R, the below line is used to read to weight file. pos$FILE<-paste0(opt$weights_dir,'/',pos$PANEL,'/',pos$PANEL,'/',sub('.*/','',pos$WGT)) Do you intentionally include two pos$PANEL? I have to remove one for the code to work.

3) I got error messages running the below code

Rscript ./FUSION_ref_scorer.R  \
--ref_plink_chr /fs1/storedir/huangjian//analysis/AAA-data/1000G/LDREF/EAS/1000G.EAS.hm3.   \
--weights /fs1/storedir/huangjian//analysis/eQTL/GTEx8/precomputed_eQTL_weights/GTExv8.ALL.Whole_Blood.nofilter.pos \ 
--plink plink1.9   \
--pigz /fs1/storedir/huangjian//analysis/AAA-script/pigz-2.7/   \
--score_files /fs1/storedir/huangjian//analysis/eQTL/GTEx8/precomputed_eQTL_scores/SCORE_FILES_Whole_Blood   \
--output /fs1/storedir/huangjian//analysis/eQTL/GTEx8/1000G_ref_scores/1000G_EAS_hm3-Whole_Blood 

I got error messages like this one, and the output was only a lot of ENSGxxx.profile_mini files with only the gene ID in the file. cannot open file /fs1/storedir/huangjian//analysis/eQTL/GTEx8/1000G_ref_scores/ENSG00000215914.4.profile for reading (No such file or directory)

Any suggestions will be helpful. Thank you very much.

opain commented 1 year ago

Hello Jian Huang,

Thank you for your interest in the work.

  1. The --pigz parameter should be the complete path to the pigz binary file (i.e. including the pigz binary not just the directory it is in). Assuming the binary file is named 'pigz', your --pigz parameter should be '/fs1/storedir/huangjian//analysis/AAA-script/pigz-2.7/pigz'

  2. This depends on how you store the FUSION SNP-weights and the .pos file you are using. I have data for each PANEL in a separate folder, and when I specify this structure when I create the file 'All_tissues.pos' by adding an extra '$PANEL/' to the WGT column. For example a weight file is here 'SNP-weights/Whole_Blood/Whole_Blood/Whole_Blood.ENSG00000273355.1.wgt.RDat', and the .pos file WGT column would show 'Whole_Blood/Whole_Blood/Whole_Blood.ENSG00000273355.1.wgt.RDat'. So when I run FUSION_score_file_creator.R, I specify:

--weights All_tissues.pos \ --weights_dir SNP-weights \

Sorry if this is a little clunky. If editing the script makes it work for your set up, that is a fine solution.

  1. That error indicates it cannot find the file, suggesting the previous line failed. Is there a log file produced by plink called '/fs1/storedir/huangjian//analysis/eQTL/GTEx8/1000G_ref_scores/ENSG00000215914.4.log'. This would help a lot. Otherwise, perhaps double check the required score file exists. i.e. '/fs1/storedir/huangjian//analysis/eQTL/GTEx8/precomputed_eQTL_scores/SCORE_FILES_Whole_Blood/ENSG00000215914.4.SCORE', and if it does, look at it to see whether the file looks as it should. You could also try running the plink command on the command line to get the output printed to your screen. I.e.:
plink1.9 \
--bfile /fs1/storedir/huangjian//analysis/AAA-data/1000G/LDREF/EAS/1000G.EAS.hm3.1 \
--extract /fs1/storedir/huangjian//analysis/eQTL/GTEx8/precomputed_eQTL_scores/SCORE_FILES_Whole_Blood/ENSG00000215914.4.snplist \
--allow-no-sex \
--score /fs1/storedir/huangjian//analysis/eQTL/GTEx8/precomputed_eQTL_scores/SCORE_FILES_Whole_Blood/ENSG00000215914.4.SCORE 1 2 4 \
--out /fs1/storedir/huangjian//analysis/eQTL/GTEx8/1000G_ref_scores/ENSG00000215914.4

Let me know how you get on.

Ollie

jianvhuang commented 1 year ago

I can run the Rscript ./FUSION_ref_scorer.R after changing the--plink plink1.9 to --plink plink2. It should be a problem with the plink version in my system.

Running the below script directly also gives me the correct output (a logfile and a .sscore file).

plink2 \
--bfile /fs1/storedir/huangjian//analysis/AAA-data/1000G/LDREF/EAS/1000G.EAS.hm3.1 \
--extract /fs1/storedir/huangjian//analysis/eQTL/GTEx8/precomputed_eQTL_scores/SCORE_FILES_Whole_Blood/ENSG00000215914.4.snplist \
--allow-no-sex \
--score /fs1/storedir/huangjian//analysis/eQTL/GTEx8/precomputed_eQTL_scores/SCORE_FILES_Whole_Blood/ENSG00000215914.4.SCORE 1 2 4 \
--out /fs1/storedir/huangjian//analysis/eQTL/GTEx8/1000G_ref_scores/ENSG00000215914.4

The output for each gene includes a logfile, a .profile_mini file, and a .sscore file.

  1. Is it normal for some genes to have all NA values in the input ENSGxxxxxx.SCORE file? reference score for these genes cannot be calculated because there are no SNPs in the input .SCORE file.
  2. For some genes, the same error I mentioned is still returned butFUSION_ref_scorer.R still generate valid output. And the input .SCORE file for these genes are also valid (with four columns: SNP, A1, A2, SCORE)
  3. I also have a methodology question about using the reference sample. You mentioned in your paper that, "GeRS and PRS were calculated within a reference-standardized framework, whereby the resulting PRS and GeRS are not influenced by target sample specific properties including availability of variants and measurements of LD and allele frequency." I understand this can maintain consistency across GeRS in different target samples. But I wonder how important it is to use the LD and allele frequency from a reference sample (e.g., 1000G). Wouldn't it be better (more relevant) to use the LD and allele frequency from the target sample?

Thank you.

opain commented 1 year ago
  1. If the score file is all NA, it is probably worth checking the original weights file used (.Rdat). I would read the weights file into R and see what is going on.
  2. Try the advice I gave before, i.e. look at the log file produced by plink, or if that is not present try running the plink command on the command line to see what is causing the error.
  3. You can use any reference you like. I think being consistent is useful. I don't think there is a benefit to using the target sample as the reference, unless your target sample recapitulates the LD in the original data used to generate the SNP-weights (expression weights or GWAS effect sizes). The LD reference should match the training data, not the target.
jianvhuang commented 1 year ago

Thank you for the advice.

The main reason that I had .SCORE files with all NA rows and empty .SCORE files from FUSION_score_file_creator.R is that I used the nofilter.pos from GTEx8 SNP-weights. For these genes, the wgt.matrix of the best methods, as selected in your script best = which.min(cv.performance[2,]), contains all zero or all NA values.

Once I used the .pos files that only include genes with significant heritability to generate the score using FUSION_score_file_creator.R, most of these empty or NA .SCORE files were excluded from the analysis. However, there is one gene (ENSG00000120910.14) returning an empty .SCORE file based on GTEx8 Whole_Blood SNP-weights. The wgt.matrix of the best methods (in this case "enet") was zero for all SNPs.

However, a Prediction_failed.txt was returned. The error reported is "Error: No valid entries in --score file." After running the plink command directly for a few genes included in the Prediction_failed.txt, I found that the error plink returned was --score file entries were skipped due to mismatching allele codes. And by comparing the SNPs allele codes in the 1000G bim files with those in the RDat files, I found that the allele codes were coded differently in these files. Between 18 to 38 genes were reported in the Prediction_failed.txt for Whole_Blood and 13 brain-related tissues from GTEx8. I don't think FUSION_ref_scorer.R includes a step to harmonise these mismatched allele codes, but did you do anything about this in your work? It seems to me that it would be easier to change the allele codes in the RDat files (to match the 1000G bim files) if I want to make sure no genes were to be excluded due to mismatching allele codes. But I might have to flip the sign of the SNP-weights values if I am flipping A1 and A2.

For example,

# Whole_Blood .SCORE file for ENSG00000269713.7 (only two SNPs were reported for this gene)
           V1        V2   V3      V4
1: rs682536      A    G     -0.0288
2: rs12124527  A    G     -0.2501

# bim file of 1000G (I am using the EAS pop)
# a subset of SNPs available reported in the .SCORE file for ENSG00000269713.7
bim[V2%in%score3$V1,]
     V1         V2           V3          V4        V5 V6
1:    1    rs682536       0     144954851  T  C
2:    1    rs12124527   0     144896319  C  T

Also, it is best to stay with plink/1.9 because FUSION_ref_scorer.R take results from --score. The output formats of --score in plink/1.9 and plink/2 are different.

Thank you very much.