doubts about UMAP with small datasets

MartaBenegas commented 4 years ago

Hi, first of all thanks for your great job! I'm new on this kind of analysis so I have a few doubts interpreting the results, I hope this doesn't bother you too much. I was testing a pipeline with a small SMART-seq2 dataset of 34 cells, which is a part of this atlas project. Here you have my code:

raw <- readRDS("/home/marta/Descargas/dge_zumis/smartseq_dm.dgecounts.rds")
dge <- as.matrix(raw$readcount$exon$all)
zumis <- CreateSeuratObject(counts = dge, project = "smartseq_dm_zumis")
zumis <- SCTransform(zumis)
zumis <- RunPCA(zumis, assay = "SCT", npcs = 33, reduction.name = "pca_stc")
zumis <- FindNeighbors(zumis, dims = 1:3, assay = "SCT", graph.name = "SCT_snn") #use dimensions that have significance - use plots to decide
zumis <- FindClusters(zumis, resolution = 2, graph.name = "SCT_snn")
zumis <- RunUMAP(zumis, graph = "SCT_snn", reduction = "pca_stc", umap.method = "umap-learn")

And I obtain the following UMAP: imagen As you see, it is very disperse and I really had to force the resolution parameter on FindClusters so it could give me any clusters. In order to test if it was because of the dataset itself or because the size I've performed the same procedure with a subset of a larger dataset I was analyzing too. This is the UMAP for the entire dataset: imagen I've chosen three cells of each cluster (39 cells in total) and re-made the analysis: imagen And I've checked that the original structure is more or less conserved (taking into account that a lot of information is missed compared with the original dataset), as cells that cluster together in the original dataset are in general grouped on a same cluster in the subset analysis: imagen And now a few questions arise:

Can I believe the results for the SMART-seq dataset considering that I had to force the clustering by specifying a resolution of 2?
Does it mean anything that cells from different clusters are mixed together on the UMAP visualization? Does it mean that I can not trust the clustering?
Are this type of analysis suited for small datasets?

And besides that, I've realized that in the clustering tutorial you perform first the clustering and then de UMAP, but in the integrating datasets tutorial you do it the other way around; first the UMAP and then the clustering:

I though that the UMAP needed the clustering in order to do the representation? If not, what is the difference between this two steps?

Thank you in advance and I'm sorry for the amount of questions. If you can refer me to a paper or tutorial that answer my questions I'll be happy with that too. Marta.

timoast commented 4 years ago

In general, these methods are not suited to small datasets like this. UMAP and Louvain (community detection method) both rely on the construction of a neighbor graph. Louvain tries to find optimal partitions in the graph that basically maximize the within-partition neighbor connections and minimize the number of connections going outside the partition. This is a k-nearest neighbor graph, where k is typically a number ~20. You can imagine then that if you have 20 cells and k=20, then every cell is the nearest neighbor if every other cell and this method won't be able to find any meaningful partitions. Similar problem for UMAP: it builds a knn graph and tries to find a low-dimensional embedding that preserves the high-dimensional distances between cells.

MartaBenegas commented 4 years ago

Thanks for your answer, it is very clarifying. But I still find hard to understand what I explain on question 4 about the difference between UMAP and FindNeighbors functions.

Best, Marta.

satijalab / seurat

doubts about UMAP with small datasets #3077