Open YihengJiang0912 opened 1 month ago
Dear Yiheng Jiang,
the easiest way to solve your problem probably is to upgrade to the latest version, i.e. R-4.4.0, Bioconductor 3.19 and GSVA 1.52.0 (or later), brandnew and available since last week.
In that latest version, the old API for GSVA that you are currently using is defunct and you would have to change your code to use e.g.:
sp <- ssgseaParam(exprData=expr_m, geneSets=my_signature, minSize=10)
ssGSEA_matrix <- gsva(sp)
In addition, please note that
kcdf
and absRanking
apply only to method GSVA and have no effect when using ssGSEA.min.sz
, now renamed to minSize
, must be a numerical value specifying the minimum size for gene sets to use; in the example above I have set it to 10. (I realize that the sentence Consider setting 'min.sz > 1'.
, as used by the warning message, is misleading and we should change it.)However, even going back to the latest version accepting your code ( R-4.3.3, Bioconductor 3.18, GSVA 1.50.5) I have tried and could not reproduce your problem using test data -- are there NA values in your data in expr_m[, 101:10122]
? What does e.g. any(is.na(expr_m))
say?
If upgrading doesn't solve your problem and it is not caused by NA values in your data, please let me know and do not forget to include the output of sessionInfo()
.
Hi,
I may be having a problem that could be related. Our datasets in particular have different genes missing per-sample, so we cannot eliminate the set of genes that were not measured prior to running ssGSEA, we instead have to work with a sparse matrix. With the files that I've attached, running the following code
source("gene_sets.txt")
source("expr_mat.txt")
params <- GSVA::ssgseaParam(exprData = expr_mat,
geneSets = gene_sets,
normalize = TRUE)
res <- GSVA::gsva(params)
results in a matrix that contains only NA
s. I trace the issue to the normalization step here:
es <- es[, 1:n, drop=FALSE] / (range(es)[2] - range(es)[1])
Since if there are NA
s in the input matrix, range(es)
is NA NA
, all values in the matrix after normalization become NA
. We are currently using the following fix:
es <- es[, 1:n, drop=FALSE] / (range(es, na.rm = TRUE)[2] - range(es, na.rm = TRUE)[1])
Also, we remove NA
values from the gene ranking before sending them to the .fastRndWalk
function here:
geneRanking <- order(R[, j], decreasing=TRUE, na.last = NA)
geneSetsRankIdx <- lapply(match(geneSets, geneRanking), na.omit)
Do you think this makes sense to do?
Thank you!
Hi, thanks for attaching the data. Looking at it we see that actually most of the data are NA values, for instance:
dim(expr_mat)
[1] 668 1281
naspergene <- rowSums(is.na(expr_mat))
quantile(naspergene, prob=0.5)
50%
1246.5
1247/ncol(expr_mat)
[1] 0.9734582
so, 50% of your rows (genes?) have NA values in 97% of the columns (cells?) . What kind of data is this? Is this single-cell data?
Hi,
Thanks for looking into this. This is actually a subset of data from CPTAC. The data are taken from antibody probes, of which there are several per gene, which we aggregate to the gene-level by taking the mean. In fact we wouldn't use this data set in particular, but it served well in illustrating how this can be a problem.
Nevertheless, we can see that this can be a problem taking the setup that is illustrated in the vignette:
set.seed(42)
p <- 10000 ## number of genes
n <- 30 ## number of samples
X <- matrix(rnorm(p*n), nrow=p,
dimnames=list(paste0("g", 1:p), paste0("s", 1:n)))
gs <- as.list(sample(10:100, size=100, replace=TRUE))
gs <- lapply(gs, \(n, p) paste0("g", sample(1:p, size=n)), p)
names(gs) <- paste0("gs", 1:length(gs))
And adding a more reasonable number of NAs:
na_rate <- 0.01
na_inds <- sample(seq_along(X), na_rate*length(X))
X_1pct <- X
X_1pct[na_inds] <- NA
Running ssGSEA with normalization results in a matrix with only NAs:
param <- GSVA::ssgseaParam(X_1pct, gs)
res <- GSVA::gsva(param)
all(is.na(res))
[1] TRUE
In fact if we use only one NA, we get the same result:
X_1 <- X
X_1[gs[[1]][[1]],1] <- NA
param <- GSVA::ssgseaParam(X_1, gs)
res <- GSVA::gsva(param)
all(is.na(res))
[1] TRUE
Without normalizing the results are different
param <- GSVA::ssgseaParam(X_1pct, gs, normalize = FALSE)
res <- GSVA::gsva(param)
sum(is.na(res))/length(res)
[1] 0.4166667
param <- GSVA::ssgseaParam(X_1, gs, normalize = FALSE)
res <- GSVA::gsva(param)
sum(is.na(res))
[1] 2
With the above proposed fixes, no NAs are generated in the results. Do you agree that this represents a good solution?
Thank you!
Hi, so, this is some kind of proteomics data, right? My perception is that the amount of NAs is massive at least in the dataset you attached. Typically for proteomics data, one would discard rows (proteins?) and columns (samples?) with too many NAs, and attempt imputing values to make the data set complete.
The approach you are proposing consists of discarding NAs values altogether and we would be happy to provide some implementation for that approach, probably similar to what you propose, giving some warning message about the presence of NA values. The question here is whether the results make sense. Do you have some way to verify whether the results you obtain after removing NA values make sense?
Hi,
Yes, the number of NAs there is greater than normal because I illustrated this with an aggregation of a subset of the data.
Below I illustrate a comparison of using the ssGSEA method (with the above fix) and mean aggregation to determine if the protein expression values are enriched in groups of patients who showed a positive response to a drug in a clinical trial. REACTOME pathways which show association with response with p <= 0.1 are shown on the x axis with mean aggregation (i.e. the mean of all genes in the gene set, dropping NAs) and ssGSEA on the y-axis. We use the signed log10 p-value to show that both the direction of effect and significance of the test are similar.
The given positively correlated pathways are in line with published data.
Let me know if you would like me to illustrate this in some other way. Thank you!
Best, Bob
Hi, ok, I've opened a new issue, mentioned here above, to implement a missing data policy for ssGSEA. Thanks for the input. I'll let you know once this is part of the software. By now, I think it should be safe for you to use the patch you've implemented.
Great! Looking forward.
Dear authors, I am now using ssgsea method in gsva to evaluate the samples' enrichment score in a dataset, but the result is all NA, I would be very grateful if you could give me some guidance on that.
Belowing please kindly find the result: `> dim(expr_m) [1] 20502 10122
[1] "Normalizing..." Warning messages: 1: In .filterFeatures(expr, method) : 4180 genes with constant expression values throuhgout the samples. 2: In .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq, abs.ranking, : Some gene sets have size one. Consider setting 'min.sz > 1'.
However, if I only use part of my dataset, I can get the results:
`> ssGSEA_matrix <- gsva(expr = expr_m[,1:100], gset.idx.list = my_signature, method = 'ssgsea', kcdf = "Gaussian", min.sz > 1, abs.ranking = TRUE) Estimating ssGSEA scores for 14 gene sets. [1] "Calculating ranks..." [1] "Calculating absolute values from ranks..." |=========================================================================================================| 100%
[1] "Normalizing..." Warning messages: 1: In .filterFeatures(expr, method) : 571 genes with constant expression values throuhgout the samples. 2: In .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq, abs.ranking, : Some gene sets have size one. Consider setting 'min.sz > 1'.