kkdey / GSSG

Gene Set + S2G strategy annotations analyzed for disease architecture
46 stars 12 forks source link

Please help with gene_scores #6

Closed bitcometz closed 3 years ago

bitcometz commented 3 years ago

hi, I follow the manuscipt to calculate the gene scores:

image

but a few genes (about 188/21787) are quite different from upload results: https://alkesgroup.broadinstitute.org/LDSCORE/Jagadeesh_Dey_sclinker/gene_scores/disease_progression_programs/UC/

these 188 genes are usually > 0 in my reuslt (score1) but 0 in the upload results (score2):

image

image

Please colud you tell me some details about how to calculate the scores?

Thanks !!!

kkdey commented 3 years ago

Hi @bitcometz, you may see some differences because of the differences in your gene universe versus mine (from ExPecto paper, Zhou et al 2018 Nat Genet). Also, we only consider genes with differentially higher expression in disease state versus healthy. Or maybe you can check if there are any bugs in the code for processing the programs. Hard to pinpoint.

bitcometz commented 3 years ago

@kkdey , thanks for your reply. I almost got the result from scanpy and there are very few possibilities that we will go wrong. Can you provide relevant scripts for calculating gene score?

Thanks !!!

fanying2015 commented 3 years ago

Hi @bitcometz, you may see some differences because of the differences in your gene universe versus mine (from ExPecto paper, Zhou et al 2018 Nat Genet). Also, we only consider genes with differentially higher expression in disease state versus healthy. Or maybe you can check if there are any bugs in the code for processing the programs. Hard to pinpoint.

@kkdey I also have questions about the gene score calculation. Does it only use the transformed and normalized p-value as the input gene score for LDSC annotation, as shown at the top of the thread?

I thought the max-min normalization was done across cell types in each scRNA-seq dataset. If so there should be one cell type of value=1 for each gene if the gene is expressed, but no cell (L2) has NCAM1=1 in the ICA_bonemarrow for instance. https://alkesgroup.broadinstitute.org/LDSCORE/Jagadeesh_Dey_sclinker/gene_scores/cell_type_programs/ICA_bonemarrow/

I'm concerned that I missed something for the gene score calculation. I'm not sure whether other information is also incorporated here like the ones you did with the blood annotations.

Below is the R code I used for gene score calculation, adapted from the https://www.nature.com/articles/s41588-019-0456-1 paper cited.

Thank you!

pvalue_log_mat <- apply(pvalue_mat, 2, function(x) -2*log(x)) tmp <- max(pvalue_log_mat[!is.infinite(pvalue_log_mat)]) pvalue_log_mat[is.infinite(pvalue_log_mat)] <- tmp

scale_func <- function(input) { input[input==0] <- NA scaled_output <- (input-min(input, na.rm = T))/(max(input, na.rm = T)-min(input, na.rm = T)) scaled_output[is.na(scaled_output)] <- 0 return(scaled_output) }

scaled_mat_v2 <- t(apply(pvalue_log_mat, 1, function(x) scale_func(x)))

kkdey commented 3 years ago

@fanying2015 Yes the gene score is calculated per cell type by transforming and normalizing the p-values as mentioned above. The gene score of 1 represents cases where we get very high significance for that gene in that cell type. The max-min normalization is done one cell-type at a time to generate the program for that cell type

fanying2015 commented 3 years ago

@fanying2015 Yes the gene score is calculated per cell type by transforming and normalizing the p-values as mentioned above. The gene score of 1 represents cases where we get very high significance for that gene in that cell type. The max-min normalization is done one cell-type at a time to generate the program for that cell type

@kkdey Thank you for the quick response! When you say the normalization was done one cell-type at a time, does it mean it was normalized across all the genes in that cell type? Also, do you have a threshold for 'very high significance'?

Thank you!

kkdey commented 3 years ago

Here for cell type program corresponding to cell type X, it was DE of gene A in cell type X compared to gene A in other cell types. Then that p value from the DE analyzed and transformed in the above way, usually we take 10^-8 and above as very high and close to 1. No normalization was performed across genes. We did investigate normalization across genes and it was much less informative than the approach I mentioned above.

fanying2015 commented 3 years ago

Here for cell type program corresponding to cell type X, it was DE of gene A in cell type X compared to gene A in other cell types. Then that p value from the DE analyzed and transformed in the above way, usually we take 10^-8 and above as very high and close to 1. No normalization was performed across genes. We did investigate normalization across genes and it was much less informative than the approach I mentioned above.

Thank you for the quick response! I tried with a few things. The results are not exactly the same but the correlations are much higher than what I had in the past( > 95%). I'll go ahead and see whether the final output makes sense. Also, thank you for the tutorial with the S2G link. The example_geneset works for me. I'll see how it works with real gene score input.

kkdey commented 3 years ago

@fanying2015 Thats great to hear. Thanks.

FelixTheStudent commented 3 years ago

Hi @fanying2015,

I'm interested in your current R code to transform p-values to gene program scores. The code you posted above makes sense to me, but you mentioned you tried more things in the meantime and now get more reasonable results. Could you share what you have so far, even if it's incomplete, and I see what makes sense in my own hands?

That would be much appreciated! Thanks also @kkdey for the swift responses, and good luck with publishing this nice project!

fanying2015 commented 3 years ago

Hi @fanying2015,

I'm interested in your current R code to transform p-values to gene program scores. The code you posted above makes sense to me, but you mentioned you tried more things in the meantime and now get more reasonable results. Could you share what you have so far, even if it's incomplete, and I see what makes sense in my own hands?

That would be much appreciated! Thanks also @kkdey for the swift responses, and good luck with publishing this nice project!

Hi @FelixTheStudent Below is the key code I used. For some datasets the correlation could be lower. An example zscore file for ICA bone marrow I generated from the step1 code is also attached. Let me know if you find some bugs or ways to improve the results.

Thank you!

ICA_bonemarrow_score.csv

pvalue_mat <- apply(zscore_mat, 2, function(x) pnorm((x), mean=0, sd=1, lower.tail=F)) pvalue_mat[which(zscore_mat <= 0)] <- 1 pvalue_log_mat <- apply(pvalue_mat, 2, function(x) -2*log(x))

scale_func <- function(input) { if(max(input)==Inf & max(input[!is.infinite(input)]) < -2log(1e-8)) { scaled_output <- rep(0, length(input)) scaled_output[which(is.infinite(input))] <- 1 } else if(max(input) < -2log(1e-8)) { max_value=-2*log(1e-8) scaled_output <- (input-min(input, na.rm = T))/(max_value -min(input, na.rm = T)) scaled_output[is.na(scaled_output)] <- 0 } else { max_value=max(input[!is.infinite(input)]) tmp <- max(input[!is.infinite(input)]) input[is.infinite(input)] <- tmp scaled_output <- (input-min(input, na.rm = T))/(max_value -min(input, na.rm = T)) scaled_output[is.na(scaled_output)] <- 0 } return(scaled_output) }

scaled_mat <- t(apply(pvalue_log_mat, 1, function(x) scale_func(x)))

FelixTheStudent commented 3 years ago

Awesome, thanks. I'll look at this soon and will get back to you if I find bugs or improvements.

Best,

Felix

FelixTheStudent commented 3 years ago

Hi @fanying2015 and @kkdey,

Fanying's code runs without error (after adding '' in two places: -2log(p) --> -2log(p)).

What's not clear to me, though, is where the z-scores you use are coming from. The supplied script exports p-values, logFC and the test statistics, right?

Glad if you could comment briefly.

Best,

Felix

FelixTheStudent commented 3 years ago

I assume the Wilcoxon score is the z-score? In that case, I can not reproduce the gene program for Glutamatergic_L2.

image

jupyter_score is what I computed with the jupyter notebook. alkes_score is what I downloaded from the authors.

Any ideas?

Best,

Felix

FelixTheStudent commented 3 years ago

Hi @fanying2015,

I figured out my own way of transforming p-values to gene program scores, and will share it here for completeness.

In the following, brain_score.csv and brain_pval.csv were generated by the supplied python script.

# min_pval has to be set as closely to Jagadeesh's default as possible
# (I'm guessing python's machine epsilon), and I found 1e-8 to work reasonably well.
pval_to_score <- function(pvalues, min_pval=1e-8) {
  x <- pvalues
  # NAs should have 0 in final output, so I set p=1 here:
  x[is.na(x)] <- 1
  # The minmax transformation below produces drastically different results
  # depending on the minimal p-value, so it has to be controlled:
  x <- scales::squish(x, range = c(min_pval, 1), only.finite = FALSE)
  x <- -2 * log(x)                    # follows a Chi-square distribution
  x <- (x-min(x)) / (max(x)-min(x))   # min/max transformation
  return(x)
}

# Underexpressed genes have scores < 0.
score <- "results/playground/gene_programs/celltype_programs/brain_score.csv" %>%
  read_csv %>%
  column_to_rownames("X1") %>% 
  data.frame
pval  <- "results/playground/gene_programs/celltype_programs/brain_pval.csv" %>%
  read_csv %>%
  column_to_rownames("X1") %>% 
  data.frame

# sclinker uses OVERexpression; Set p-values for UNDERexpressed genes to 1:
for(col_id in 1:ncol(pval)) {
  pvalues <- pval[, col_id]
  down <- score[,col_id] < 0
  pvalues[down] <- 1
  pval[, col_id] <- pvalues
}

# convert to gene program score:
gene_scores <- apply(pval, 2, pval_to_score, min_pval=3e-8)

The correlation with the gene program Jagadeesh et al. uploaded (download here) is 0.99: glutamatergic_cor is 0.99.

Best,

Felix

fanying2015 commented 3 years ago

Hi @fanying2015,

I figured out my own way of transforming p-values to gene program scores, and will share it here for completeness.

In the following, brain_score.csv and brain_pval.csv were generated by the supplied python script.

# min_pval has to be set as closely to Jagadeesh's default as possible
# (I'm guessing python's machine epsilon), and I found 1e-8 to work reasonably well.
pval_to_score <- function(pvalues, min_pval=1e-8) {
  x <- pvalues
  # NAs should have 0 in final output, so I set p=1 here:
  x[is.na(x)] <- 1
  # The minmax transformation below produces drastically different results
  # depending on the minimal p-value, so it has to be controlled:
  x <- scales::squish(x, range = c(min_pval, 1), only.finite = FALSE)
  x <- -2 * log(x)                    # follows a Chi-square distribution
  x <- (x-min(x)) / (max(x)-min(x))   # min/max transformation
  return(x)
}

# Underexpressed genes have scores < 0.
score <- "results/playground/gene_programs/celltype_programs/brain_score.csv" %>%
  read_csv %>%
  column_to_rownames("X1") %>% 
  data.frame
pval  <- "results/playground/gene_programs/celltype_programs/brain_pval.csv" %>%
  read_csv %>%
  column_to_rownames("X1") %>% 
  data.frame

# sclinker uses OVERexpression; Set p-values for UNDERexpressed genes to 1:
for(col_id in 1:ncol(pval)) {
  pvalues <- pval[, col_id]
  down <- score[,col_id] < 0
  pvalues[down] <- 1
  pval[, col_id] <- pvalues
}

# convert to gene program score:
gene_scores <- apply(pval, 2, pval_to_score, min_pval=3e-8)

The correlation with the gene program Jagadeesh et al. uploaded (download here) is 0.99: glutamatergic_cor is 0.99.

Best,

Felix

Hi Felix,

Thank you for sharing the code. I'm glad it worked nicely! I would definitely love to try it and see whether it improves the final outcome.

Two quick questions/comments: 1) Why did you use "min_pval=3e-8" in the last line of code? 2) Actually I had some issue choosing between the pvalue output and zscore output. I believe the pvalue output is actually "pvals_adj" from the codes you cited. That's why there's NA values. I'm not quite sure whether they are using real pvalues (which can be converted from zscores) or p.adj. It's more a comment, as long as the codes work.