satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.31k stars 919 forks source link

Selecting variable genes for integrated analysis (var.genes vs hvg.info) #227

Closed dat4git closed 6 years ago

dat4git commented 7 years ago

According to the Seurat Alignment tutorial, the union of the top 2k variable genes in each dataset is used for CCA.

hvg.seqwell <- rownames(x = head(x = seqwell@hvg.info, n = 2000))
hvg.tenx <- rownames(x = head(x = tenx@hvg.info, n = 2000))
hvg.union <- union(x = hvg.seqwell, y = hvg.tenx)

Why are the top 2k genes from object@hvg.info used instead of the variable genes listed in object@var.genes? Many of the var.genes in the tenx dataset are not present in the top 2k genes in tenx@hvg.info.

length(seqwell@var.genes)
> 763
length(tenx@var.genes)
> 2237
length(hvg.tenx)
>2000

length(setdiff(tenx@var.genes,hvg.tenx))
>1272
satijalab commented 7 years ago

Thanks for the good question. There are multiple ways to call highly variable genes in a dataset. FindVariableGenes uses the method we first applied in the original DropSeq paper, binning genes by abundance and searching for highly variable genes within each bin. However, rownames in hvg.info are simply sorted by the variance/mean ratio, which is a more simple statistic, but usually works quite well for UMI data (as first demonstrated by the original MARS-seq paper).

For the integration procedure, we chose to take the top 2k genes per dataset based on the variance/mean ratio, as it meant that it was easy for each dataset to contribute the same number of genes, and allowed us to choose HVG in an identical way for multiple examples in the manuscript. As you see, the methods for HVG identification don't agree perfectly, but I would expect the downstream results to be similar with either approach.

dat4git commented 7 years ago

Thanks for the detailed response. Would you expect the results to change if one dataset was smaller or noisier than the other? For eg, if the first dataset is a set of 400 cells and the purpose of the integration analysis is to map these cells to a larger reference dataset (comprised of 1000s of cells) that has already been clustered and classified. Conceptually wouldn't it be better to use HVGs preferentially from the reference data to perform the integrated analysis since the HVGs from the sample dataset could have greater variance (which could change the structure of the ref data)?

satijalab commented 6 years ago

This is an interesting question, and I would say there is no consensus in how to handle these situations with large discrepancies in dataset cell number (especially as there may also be differences in dataset depth). Giving specific advice on dataset analysis is a bit beyond the scope of Github issues, so I will close this.