satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.25k stars 904 forks source link

How can I select features to scale using SCTransform? #4877

Closed MrModenait closed 3 years ago

MrModenait commented 3 years ago

Dear colleagues,

I would like to reopen an old issue from January (#3916). Back then, @jgamache014 asked how to obtain all features from datasets in an integrated SCTransformed object for DE analysis. The question was never answered, as @HerpelinckT pointed out that DE analysis should be done on the RNAassay, and the question because obsolete before it was answered. However, in integrating two of my datasets, I would still like to choose specific genes to include in my integrated dataset. Briefly, I need to keep the gene expression data for the famous chemoattractant CCL2 to find which cluster has the highest CCL2 expression. Selecting custom genes to keep used to be possible with ScaleData (as I showed in January). How can I do this with SCTransform? Here's a walkthrough of the problem. Both datasets have 33,538 features in the Counts and the Seurat object (using min.cells = 0 for CreateSeuratObject), and CCL2 is included in these. I then proceed to run SCTransform on the list:

SCT_Dataset_List <- list(1,2) #Prepare new list
for (i in 1:length(Dataset_List)) {
  SCT_Dataset_List[[i]] <- SCTransform(Dataset_List[[i]], 
                                           assay = 'RNA',
                                           new.assay.name = 'SCT',
                                           vars.to.regress = c('percent.mt', 'nFeature_RNA', 'nCount_RNA', "cc_difference"),
                                           return.only.var.genes = F)
}

After SCTransform, the first dataset has 12,499 genes while the second dataset has 9,106 genes, counted using rownames, and CCL2 is missing in both. The hurdle can be overcome by setting residual.features = all_genes in SCTransform, where all_genes is a character vector of my 33,538 features calculated using:

all_genes <- sort(intersect(rownames(COJEC_Dataset_List[[1]]),
                            rownames(COJEC_Dataset_List[[2]])))

This yields 33,538 genes in the SCTransformed list, however trouble begins as I integrate the data:

Features <- SelectIntegrationFeatures(object.list = SCT_Dataset_List, nfeatures = 3000)
SCT_Dataset_List <- PrepSCTIntegration(object.list = SCT_Dataset_List, 
                                           anchor.features = Features, 
                                           verbose = T)
Anchors <- FindIntegrationAnchors(object.list = SCT_Dataset_List, 
                                      normalization.method = "SCT", 
                                      anchor.features = Features, verbose = T)
Integrated <- IntegrateData(anchorset = Anchors, normalization.method = "SCT", 
                                  features.to.integrate = all_genes, verbose = T,
                                  new.assay.name = "Integrated")

The integration fails and the message I get is:

sct.model: model1
Calculating residuals of type pearson for 9499 genes
|===========================| 100%
sct.model: model1
Calculating residuals of type pearson for 6106 genes
|===========================| 100%
Merging dataset 1 into 2
Extracting anchors for merged samples
Finding integration vectors
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 't': invalid character indexing
In addition: Warning messages:
1: The following requested features are not present in any models: A1CF, A2ML1, A2ML1-AS1, A2ML1-AS2...

CCL2 is one of the genes in the long list at the bottom that was "not present in any models". It is worth noting that if I set features.to.integrate = NULL, the integration succeeds but I only obtain 3000 genes of which CCL2 is not included. I would like to make it clear that I want my integration to be determined by the default anchor/variable features, I just want to carry CCL2 along to plot it on the final UMAP using FeaturePlot. Any suggestions on how I may overcome the error message, or how Imay achieve the same result?

Best Regards, Gustav C.

saketkc commented 3 years ago

SCTransform performs a filtering by default it only keeps genes that are expressed in >3 cells. What is the expression pattern of CCL2 in your case (for both the objects)? Also, nFeature_RNA and nCount_RNA need (and should) not be regressed given the SCT model implicitly accounts for these.

MrModenait commented 3 years ago

Dear @saketkc,

It turns out the question was once again made irrelevant as I opted for query-based integration from Mapping and annotating query datasets, and I soon discovered that not a single cell in the dataset expressed the CCL2 gene. Does your answer imply that if, after running SCTransform with residual.features = NULL and after integrating with features.to.integrate = all_genes, a gene is missing from the scale.data slot of the SCT assay, that gene was expressed in 3 cells or fewer?

Best Regards, Gustav C.

P.S. As for vars.to.regress, I was not aware of this: thank you for pointing it out!

saketkc commented 3 years ago

Does your answer imply that if, after running SCTransform with residual.features = NULL and after integrating with features.to.integrate = all_genes, a gene is missing from the scale.data slot of the SCT assay, that gene was expressed in 3 cells or fewer?

Yes, but only if return.only.var.genes=F is set. Otherwise, it will only use the variable genes to create the scale.data slot.

mea2712 commented 3 years ago

Hi @saketkc , I'm running into a similar problem where a would like to pass a SCT normalized and integrated matrix into other methods. However I would like to keep all genes in the original counts matrices. Is there a way to disable the SCTransform() filtering of genes expressed in fewer than (I believe default) 5 cells? What I'm doing:

seu_obj_list<-lapply(seu_obj_list, function(x){ temp<-SCTransform(x, return.only.var.genes=FALSE, method="offset", min_cells=0); return(temp)})

FTI<-lapply(seu_obj_list, function(x) rownames(x[["SCT"]]@scale.data)) FTI<-unique(unlist(FTI))

The above returns a seurat object list where the '[["SCT"]]@scale.data' slot has the same dimensions as the raw counts matrices. But, when I try to integrate both data-sets I get an error:

feat<-SelectIntegrationFeatures(seu_obj_list) seu_obj_list<-PrepSCTIntegration(seu_obj_list, anchor.features=feat) seu_anchors<-FindIntegrationAnchors(seu_obj_list,feat,normalization.method="SCT") seu_combined<-IntegrateData(seu_anchors, normalization.method="SCT",features.to.integrate=FTI)

Gives me: ... Merging dataset 2 into 1 Extracting anchors for merged samples Finding integration vectors Error in h(simpleError(msg,call)) : error in evaluating the argument 'x' in selecting a method for function 't': invalid character indexing In addtion: Warning messages: 1: Residuals not computed for the following requested features: ...... And I get 2000 features with no residuals Any ideas? Thanks!

saketkc commented 3 years ago

I am not sure if retaining those genes help given they are expressed in so few cells, but you can do `SCTransform(object, method="glmGamPoi", min_cells=1).

Also we show in our recent preprint that offset model does not lead to optimal VST, curious about your choice of model here.

Do all the objects in your object list have the same genes to start off with? I would suggest you create a new issue to follow up.

mea2712 commented 3 years ago

@saketkc Sorry for the delayed response. I wasn't tagged so your answer never popped up

I chose the offset method based on their paper https://www.biorxiv.org/content/10.1101/2020.12.01.405886v3.abstract However, thank you for pointing me out to your pre-print!

All objects in my list have the exact same rows. I will create a new issue Thanks

ggruenhagen3 commented 10 months ago

I wanted to chime in here for people who are interested in setting variable features in SCTransform. There is an argument in the SCTransform function called return.only.var.genes that you can set to get the scaled data for all features that SCTransform calculates residuals for (ie the same genes in the data slot). SCTransform calculates residuals for all genes expressed in more than 3 cells, I think. So this solution does not work if you want to use a gene expressed in 3 or less cells as a varaible feature, but I think that would be a bad idea anyway. Once all of these genes are returned, the scale.data matrix can be subset by the genes you want as variable features and you can set the var.features slot to the genes that you want. Example below where desired_var_features is your vector of desired variable features (using Seurat version 4.3.0.1 and R version 4.3.1).

obj = SCTransform(obj, return.only.var.genes = F)
obj@assays$SCT@scale.data = obj@assays$SCT@scale.data[desired_var_features,]
obj@assays$SCT@var.features = desired_var_features