gusevlab / fusion_twas

methods for functional summary-based imputation
http://gusevlab.org/projects/fusion/
GNU General Public License v3.0
74 stars 43 forks source link

GEUV.compute_weights.sh #69

Closed alyssacl closed 3 months ago

alyssacl commented 3 months ago

Hello, I am testing out the GEUV.compute_weights.sh in just the 1000G.EUR population and keep running into issues.

It never creates a .hsq and gives me this error, indicating that the tmp file tmp.cv.BSLMM.param.txt doesn't exist and indeed it's never created.

The files in the .tmp folder it does create include: .bed/.bim/.fam .tmp.bed/tmp.bim/tmp.fam .tmp.cv.bed / .tmp.cv.bim / .tmp.cv.fam

Error in file(file, "rt") : cannot open the connection Calls: weights.bslmm -> read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'tmp/1_10/GD462.GeneQuantRPKM.50FN.samplename.resk10.auto.txt.ENSG00000152931.6.tmp.cv.BSLMM.param.txt': No such file or directory Execution halted

Script below.

#!/bin/sh
### -n requests total cores. In this case, we are requesting 2 cores
#SBATCH -n 2
### -N requests nodes. In this case, we are requesting 1 node
#SBATCH -N 1
### sends the stdout output of the script to a file appended with the job ID
#SBATCH --output job%j.out
### sends the stderr output of the script to a file appended with the job ID
#SBATCH --error job%j.err

cd /work/claygila/InterLymph/TWAS/GEU_weights
##Update test myk code
module load python3/anaconda/2021.11
source activate /home/claygila/.conda/envs/R_env
module load plink

# MAKE SURE FUSION.compute_weights.R IS IN YOUR PATH
# FILL IN THESE PATHS
GCTA="/work/claygila/InterLymph/TWAS/GEU_weights/gcta_nr_robust"
PLINK="/work/claygila/InterLymph/TWAS/GEU_weights/plink_linux_x86_64_20231211/plink"
GEMMA="/work/claygila/ENVs/gemma_env/bin/gemma"
# ALTERNATIVELY: ENSURE THAT plink, gcta, gemma CAN BE CALLED FROM PATH AND REMOVE --PATH_* FLAGS BELOW
# PATH TO DIRECTORY CONTAINING LDREF DATA (FROM FUSION WEBSITE or https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2)
LDREF="/work/claygila/InterLymph/TWAS/GEU_weights/LDREF"
# THIS IS USED TO RESTRICT INPUT SNPS TO REFERENCE IDS ONLY

# PATH TO GEUVADIS GENE EXPRESSION MATRIX:
PRE_GEXP="GD462.GeneQuantRPKM.50FN.samplename.resk10.auto.txt"
# GEUVADIS DATA WAS DOWNLOADED FROM https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/files/analysis_results/
# GENERATE ID FILE
cat $PRE_GEXP | head -n1 | tr '\t' '\n' | tail -n+5 | awk '{ print $1,$1 }' > ${PRE_GEXP}.ID

# PATH TO PREFIX FOR GEUVADIS GENOTYPES SPLIT BY CHROMOSOME
# SUBSAMPLE THESE TO THE LDREF SNPS FOR EFFICIENCY
PRE_GENO="1000G.EUR"

# PATH TO OUTPUT DIRECTORY (population-specific subdirs will be made)
OUT_DIR="./WEIGHTS"

# ROWS IN THE MATRIX TO ANALYZE (FOR BATCHED RUNS)
BATCH_START=1
BATCH_END=10

# --- BEGIN SCRIPT:

NR="${BATCH_START}_${BATCH_END}"
mkdir --parents tmp/$NR
mkdir --parents hsq/$NR
mkdir --parents out/$NR
# THIS IS DIRECTORY WHERE THE OUTPUT WILL GO:
mkdir $OUT_DIR

# Loop through each gene expression phenotype in the batch
cat $PRE_GEXP | awk -vs=$BATCH_START -ve=$BATCH_END 'NR > s && NR <= e' |  while read PARAM; do

# Get the gene positions +/- 500kb
CHR=`echo $PARAM | awk '{ print $3 }'`
P0=`echo $PARAM | awk '{ print $4 - 0.5e6 }'`
P1=`echo $PARAM | awk '{ print $4 + 0.5e6 }'`
GNAME=`echo $PARAM | awk '{ print $1 }'`

OUT="tmp/$NR/$PRE_GEXP.$GNAME"

echo $GNAME $CHR $P0 $P1

# Pull out the current gene expression phenotype
echo $PARAM | tr ' ' '\n' | tail -n+5 | paste $PRE_GEXP.ID - > $OUT.pheno

# Get the locus genotypes for all samples and set current gene expression as the phenotype
$PLINK --bfile $PRE_GENO.$CHR --pheno $OUT.pheno --allow-no-sex --maf 0.01 --make-bed --out $OUT --keep $OUT.pheno --chr $CHR --from-bp $P0 --to-bp $P1 --extract $LDREF/1000G.EUR.$CHR.bim

# Process all samples together (for reference purposes only since this is mult-ethnic data)
FINAL_OUT="$OUT_DIR/.$GNAME"

# MAKE SURE FUSION.compute_weights.R IS IN YOUR PATH
Rscript FUSION.compute_weights.R --bfile $OUT --tmp $OUT.tmp --out $FINAL_OUT --verbose 0 --save_hsq --hsq_set 0.01 --PATH_gcta $GCTA --PATH_gemma $GEMMA --models blup,lasso,top1,enet

# Append heritability output to hsq file
cat $FINAL_OUT.hsq >> hsq/$NR.hsq

done

The files it generates for each row of the expression files: 1_10

Screenshot 2024-06-15 at 12 03 38 PM

Error / Out files: job16547466.out.txt job16547466.err.txt

Thanks for you insights, Alyssa

alyssacl commented 3 months ago

Update: To get this to run I had to add the suggested line below to my working directory and remove blup from model selection.

ln -s ./ output