Genes missing after SCTransform and integration - how best to set # genes retained

Hi,

I am trying to analyze a dataset that includes 10 single cell experiments and about 66,000 cells. The actual set of cells I'm interested in is a small group of those, of about 7,000 cells. With the old method (log normalization) I had created a global UMAP, pulled out the clusters of interest with known marker genes, then reclustered that smaller dataset.

However, I am finding that SCTransform and the downstream integration steps cause me to lose some genes of interest in my smaller dataset, presumably because they aren't in the top 2000/3000 variable genes in the global dataset (whatever the default is, I know it varies by command).

After using the following code, altering the number and then checking for my genes of interest in the features list, I know that I need to increase the # to 3500 for three of my most important missing genes to come back.

# select integration features and prep step
    features<- SelectIntegrationFeatures(pbmc.list,nfeatures = 3500,
                                         assay = NULL,
                                         verbose = TRUE,
                                         fvf.nfeatures = 3500)
    #checking presence of genes of interest in features
    c('Gene1','Gene2','Gene3') %in% features

Questions:

(1) Would you recommend that users take the above approach to ensure that the info about all their genes of interest are captured at the initial, global steps, if they want these genes retained for downstream analyses?

(2) Would it be better to take a mixed approach that might be less computationally expensive, such as the list of 2000 most variable + some genes that are of interest for the smaller subset? In this latter case, I would fear that genes which are important for my subset of cells (but not in my list of known genes of interest) would still be eliminated at the global analysis step.

(3) As a crude final alternative, would you recommend filtering each dataset first using some kind of rule like expression of Gene1>0, then running SCTransform on the filtered individual datasets and proceeding from there?

satijalab / seurat

Genes missing after SCTransform and integration - how best to set # genes retained #9018