precimed / gsa-mixer

GNU General Public License v3.0
8 stars 2 forks source link

Interpretation of the results #5

Open humanpaingeneticslab opened 3 months ago

humanpaingeneticslab commented 3 months ago

Hello,

Is this the correct interpretation of the results?

Column output for gsa-mixer include 'enrich' and 'enrich_std'. Was expecting 'enrich_se' and/or 'enrich_pval'.

Can we do: enrich_se = enrich_std / sqrt( 100 ) enrich_Z = enrich / enrich_se enrich_pval = 2*pnorm( -abs( enrich_Z ) )

Fig. 2A of the paper states "with error bars showing Hessian-based standard error of the model estimates"; so these error bars would have been computed like 'enrich_se' above, right? Because the standard deviations were based on "we sample N = 100 realizations ... and report the standard deviations across the realizations".

Ah, I should instead use the 'gsa_mixer_combine.py' script, that does it all? "The main branch of gsa-mixer does not contain the path scripts/gsa_mixer_combine.py." :-(

Solved? "MIXER_FIGURES_PY combine ..."

humanpaingeneticslab commented 2 months ago

[X] ABSENT: scripts/gsa_mixer_combine.py [X] ABSENT: "MIXER_FIGURES_PY combine ..."

Notice that since the reported enrichment values are akin to odds ratios, we have to first take the log of the enrichment values before combining them (mean, standard deviations, etc.). My initial intent was to combine the enrichment values into a meta-analysis across the 20 replicates, with each replicate weighted by its standard error across the 100's of realizations. Unfortunately, since we have to work in the log-transformed space, we cannot simply take the log(enrich_std)... for that, we would need the enrichment values for each realization of each replicate... The R code below is intended as a temporary patch, to explore the results whilst a true solution to combine the replicates is provided.

R --no-save --quiet
options( width=121 )

#----- function to load a replicate file
load_replicate <- function( ith_rep )
{
  strFile <- paste0( 
   "model_rep", ith_rep, ".go_test_enrich.csv" )

  cat( "loading replicate ", ith_rep, "\n", sep="" )
  data <- read.table( 
   strFile,
   sep="\t", header=TRUE,
   as.is=TRUE, stringsAsFactors=FALSE,
   quote="", comment.char="" )

  mini <- data[ , c( "GO", "enrich" ) ]
  #-- enrich ~ odds ratio, so log it!
  mini[,"enrich"] <- log2( mini[,"enrich"] )
  colnames( mini ) <- c( 
    "GO", 
    paste0("log2E_rep",ith_rep) )
  mini
}

#----- load all replicates
data <- load_replicate( 1 )
for( i in 2:20 ) 
  { data <- merge( data, load_replicate( i ), by=1 ) }
nrow( data )
#-- 10474 = 10,474
ncol( data )
#-- 21 = 1 "GO" + 20 rep x 1 cols/rep

#----- compute mean, sd, se, ...
rownames( data ) <- data[,"GO"]
data   <- data[ , -c(1) ]
vec_mu <- apply( data, 1, mean )
vec_sd <- apply( data, 1, sd   )
vec_se <- vec_sd / sqrt( ncol( data ) )
vec_se[ which( 0 == vec_se ) ] <- 1

data[,"beta"] <- vec_mu
data[,"se"  ] <- vec_se
data[,"stat"] <- vec_mu / vec_se
# 95% CI: beta +/- 1.96 * se
# original units of enrichment: 2^beta (since we took log2)
data[,"pval"] <- 2*pnorm( -abs( data[,"stat"] ) )
data[,"padj"] <- p.adjust( data[,"pval"], method="fdr" )
data <- data.frame( 
  GO=rownames( data ), data, stringsAsFactors=FALSE )

#----- save results
#-- we want to sort by decreasing enrichment:
O <- order( data[,"stat"], decreasing=TRUE )
data <- data[ O, ]
head( data )

write.table( data, 
  file="go_test_enrich.csv",
  sep="\t",
  col.names=TRUE, row.names=FALSE, quote=FALSE )
ofrei commented 1 month ago

@humanpaingeneticslab sorry again for all the mess with the codes, and for my slow response.

1) enrich_std in the output files is now replaced with se_enrich. This is how it should be named, because it's a hessian-based standard errors, derived from the log-likelihood function as described in the publication. There is no need for adjustment, i.e. it would be a mistaked to do enrich_se = enrich_std / sqrt( 100 ). Overall the use of SE might be confusing, because it's not a standard error of the mean - it's an uncertainty of a parameter estimate, derived from likelihood function. Generally I think it's well accepted to refer to this as SE (for example LD score regression also provides SEs to its point estimates of h2 and rg, thought hey are not SEs of a mean value of some sort).

2) RE scripts/gsa_mixer_combine.py / MIXER_FIGURES_PY combine ..., this is no longer necessary as GSA-MiXeR procedure do not involve 20 repeats (see point 2 here ). For analyses done with --gsa-mixer-test gsa-mixer-genesetLOO-annot_10mar2023.csv, and also with a MAGMA run done as in scripts/GSA_MIXER.job it might be relevant to run post-processing script https://github.com/precimed/gsa-mixer/blob/main/scripts/process_gsa_mixer_output.py . This will prepare key output tables as in supplementary tables of the manuscript. An example inputs / SupplementaryTable.xlsx output of this script is here: https://github.com/precimed/gsa-mixer/tree/main/out_example

3) I advice against relying on statistical inference based on GSA-MiXeR parameters, i.e. p-values and FDRs computed as follows may not be well calibrated:

data[,"pval"] <- 2*pnorm( -abs( data[,"stat"] ) )
data[,"padj"] <- p.adjust( data[,"pval"], method="fdr" )

This is highlighted in the discussion section:

The GSA-MiXeR tool has some limitations. First, it was not feasible to provide formal p-values due to technical reasons and difficulties in defining the null-hypothesis (see Supplementary Information), thus GSA-MiXeR relies on the MAGMA tool to pre-filter the set of gene-sets for the most conservative analysis. Our exploratory analysis (without pre-filtering by MAGMA) selects gene-sets based on AIC criteria, which does not allow for multiple testing correction; we however confirmed that ranking gene-sets according to GSA-MiXeR fold enrichment is at least as stable as ranking according to conservatively defined MAGMA p-values; additionally, all estimates have SEs to evaluate their uncertainty. Second, SEs are derived from the likelihood function, and may in some cases be not well calibrated, particularly for genes with close to zero heritability estimate. Real-data analysis with GSA-MiXeR is unlikely to be affected by this due to filtering genes on a positive AIC value, which implies sufficient curvature of the log-likelihood around the MLE point and justifies hessian-based SEs estimation.

I'm keeping this ticket open, good to discuss this further as interpretation of GSA_MiXeR results might be tricky due to limitations outlined above!