rcastelo / GSVA

Gene set variation analysis
174 stars 41 forks source link

Result of gsva is all NA #168

Open YihengJiang0912 opened 1 month ago

YihengJiang0912 commented 1 month ago

Dear authors, I am now using ssgsea method in gsva to evaluate the samples' enrichment score in a dataset, but the result is all NA, I would be very grateful if you could give me some guidance on that.

Belowing please kindly find the result: `> dim(expr_m) [1] 20502 10122

ssGSEA_matrix <- gsva(expr = expr_m, gset.idx.list = my_signature, method = 'ssgsea', kcdf = "Gaussian", min.sz > 1, abs.ranking = TRUE) Estimating ssGSEA scores for 14 gene sets. [1] "Calculating ranks..." [1] "Calculating absolute values from ranks..." |=========================================================================================================| 100%

[1] "Normalizing..." Warning messages: 1: In .filterFeatures(expr, method) : 4180 genes with constant expression values throuhgout the samples. 2: In .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq, abs.ranking, : Some gene sets have size one. Consider setting 'min.sz > 1'.

ssGSEA_matrix[1:5,1:5] TCGA-OR-A5J1-01 TCGA-OR-A5J2-01 TCGA-OR-A5J3-01 TCGA-OR-A5J5-01 TCGA-OR-A5J6-01 Immune.enrichment.score NA NA NA NA NA Immune.cell.subsets NA NA NA NA NA Immune.signaling.molecules NA NA NA NA NA X13.T.cell.signature NA NA NA NA NA T.cells NA NA NA NA NA `

However, if I only use part of my dataset, I can get the results:

`> ssGSEA_matrix <- gsva(expr = expr_m[,1:100], gset.idx.list = my_signature, method = 'ssgsea', kcdf = "Gaussian", min.sz > 1, abs.ranking = TRUE) Estimating ssGSEA scores for 14 gene sets. [1] "Calculating ranks..." [1] "Calculating absolute values from ranks..." |=========================================================================================================| 100%

[1] "Normalizing..." Warning messages: 1: In .filterFeatures(expr, method) : 571 genes with constant expression values throuhgout the samples. 2: In .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq, abs.ranking, : Some gene sets have size one. Consider setting 'min.sz > 1'.

ssGSEA_matrix[1:5,1:5] TCGA-OR-A5J1-01 TCGA-OR-A5J2-01 TCGA-OR-A5J3-01 TCGA-OR-A5J5-01 TCGA-OR-A5J6-01 Immune.enrichment.score 0.07036719 0.04767565 -0.001057149 0.00694892 0.13481281 Immune.cell.subsets -0.24078220 -0.22031222 -0.299557598 -0.29809871 -0.18673913 Immune.signaling.molecules -0.25096208 -0.19650415 -0.319315871 -0.30783535 -0.18980250 X13.T.cell.signature -0.06467369 -0.03327788 -0.166935307 -0.18018616 0.02333241 T.cells
`

axelklenk commented 1 month ago

Dear Yiheng Jiang,

the easiest way to solve your problem probably is to upgrade to the latest version, i.e. R-4.4.0, Bioconductor 3.19 and GSVA 1.52.0 (or later), brandnew and available since last week.

In that latest version, the old API for GSVA that you are currently using is defunct and you would have to change your code to use e.g.:

sp <- ssgseaParam(exprData=expr_m, geneSets=my_signature, minSize=10)
ssGSEA_matrix <- gsva(sp)

In addition, please note that

However, even going back to the latest version accepting your code ( R-4.3.3, Bioconductor 3.18, GSVA 1.50.5) I have tried and could not reproduce your problem using test data -- are there NA values in your data in expr_m[, 101:10122] ? What does e.g. any(is.na(expr_m)) say?

If upgrading doesn't solve your problem and it is not caused by NA values in your data, please let me know and do not forget to include the output of sessionInfo().

bobzimmermann-D4C commented 2 weeks ago

Hi,

I may be having a problem that could be related. Our datasets in particular have different genes missing per-sample, so we cannot eliminate the set of genes that were not measured prior to running ssGSEA, we instead have to work with a sparse matrix. With the files that I've attached, running the following code

source("gene_sets.txt")
source("expr_mat.txt")
params <- GSVA::ssgseaParam(exprData = expr_mat, 
                            geneSets = gene_sets,
                            normalize = TRUE)
res <- GSVA::gsva(params)

results in a matrix that contains only NAs. I trace the issue to the normalization step here:

es <- es[, 1:n, drop=FALSE] / (range(es)[2] - range(es)[1])

Since if there are NAs in the input matrix, range(es) is NA NA, all values in the matrix after normalization become NA. We are currently using the following fix:

es <- es[, 1:n, drop=FALSE] / (range(es, na.rm = TRUE)[2] - range(es, na.rm = TRUE)[1])

Also, we remove NA values from the gene ranking before sending them to the .fastRndWalk function here:

    geneRanking <- order(R[, j], decreasing=TRUE, na.last = NA)
    geneSetsRankIdx <- lapply(match(geneSets, geneRanking), na.omit)

Do you think this makes sense to do?

Thank you!

expr_mat.txt gene_sets.txt session_info.txt

rcastelo commented 1 week ago

Hi, thanks for attaching the data. Looking at it we see that actually most of the data are NA values, for instance:

dim(expr_mat)
[1]  668 1281
naspergene <- rowSums(is.na(expr_mat))
quantile(naspergene, prob=0.5)
   50% 
1246.5 
1247/ncol(expr_mat)
[1] 0.9734582

so, 50% of your rows (genes?) have NA values in 97% of the columns (cells?) . What kind of data is this? Is this single-cell data?

bobzimmermann-D4C commented 1 week ago

Hi,

Thanks for looking into this. This is actually a subset of data from CPTAC. The data are taken from antibody probes, of which there are several per gene, which we aggregate to the gene-level by taking the mean. In fact we wouldn't use this data set in particular, but it served well in illustrating how this can be a problem.

Nevertheless, we can see that this can be a problem taking the setup that is illustrated in the vignette:

set.seed(42)
p <- 10000 ## number of genes
n <- 30    ## number of samples
X <- matrix(rnorm(p*n), nrow=p,
            dimnames=list(paste0("g", 1:p), paste0("s", 1:n)))
gs <- as.list(sample(10:100, size=100, replace=TRUE))
gs <- lapply(gs, \(n, p) paste0("g", sample(1:p, size=n)), p)
names(gs) <- paste0("gs", 1:length(gs))

And adding a more reasonable number of NAs:

na_rate <- 0.01
na_inds <- sample(seq_along(X), na_rate*length(X))
X_1pct <- X
X_1pct[na_inds] <- NA

Running ssGSEA with normalization results in a matrix with only NAs:

param <- GSVA::ssgseaParam(X_1pct, gs)
res <- GSVA::gsva(param)
all(is.na(res))
[1] TRUE

In fact if we use only one NA, we get the same result:

X_1 <- X
X_1[gs[[1]][[1]],1] <- NA
param <- GSVA::ssgseaParam(X_1, gs)
res <- GSVA::gsva(param)
all(is.na(res))
[1] TRUE

Without normalizing the results are different

param <- GSVA::ssgseaParam(X_1pct, gs, normalize = FALSE)
res <- GSVA::gsva(param)
sum(is.na(res))/length(res)
[1] 0.4166667

param <- GSVA::ssgseaParam(X_1, gs, normalize = FALSE)
res <- GSVA::gsva(param)
sum(is.na(res))
[1] 2

With the above proposed fixes, no NAs are generated in the results. Do you agree that this represents a good solution?

Thank you!

rcastelo commented 1 week ago

Hi, so, this is some kind of proteomics data, right? My perception is that the amount of NAs is massive at least in the dataset you attached. Typically for proteomics data, one would discard rows (proteins?) and columns (samples?) with too many NAs, and attempt imputing values to make the data set complete.

The approach you are proposing consists of discarding NAs values altogether and we would be happy to provide some implementation for that approach, probably similar to what you propose, giving some warning message about the presence of NA values. The question here is whether the results make sense. Do you have some way to verify whether the results you obtain after removing NA values make sense?

bobzimmermann-D4C commented 1 week ago

Hi,

Yes, the number of NAs there is greater than normal because I illustrated this with an aggregation of a subset of the data.

Below I illustrate a comparison of using the ssGSEA method (with the above fix) and mean aggregation to determine if the protein expression values are enriched in groups of patients who showed a positive response to a drug in a clinical trial. REACTOME pathways which show association with response with p <= 0.1 are shown on the x axis with mean aggregation (i.e. the mean of all genes in the gene set, dropping NAs) and ssGSEA on the y-axis. We use the signed log10 p-value to show that both the direction of effect and significance of the test are similar.

image

The given positively correlated pathways are in line with published data.

Let me know if you would like me to illustrate this in some other way. Thank you!

Best, Bob

rcastelo commented 1 week ago

Hi, ok, I've opened a new issue, mentioned here above, to implement a missing data policy for ssGSEA. Thanks for the input. I'll let you know once this is part of the software. By now, I think it should be safe for you to use the patch you've implemented.

bobzimmermann-D4C commented 1 week ago

Great! Looking forward.