Scanpy Clustering - Githubissues

Khalid-Usman commented 5 years ago

Hi,

I have few queries regarding scanpy.

As scanpy is using Louvain Leiden algorithms for clustering which optimize modularity 'Q', so how we can access and print modularity funciton?
Resolution parameter gave us different number of clusters when we iterated between the best suggested 0.6-1.2. So how we know best number of cluster and how we can choose the optimum value of parameter 'resolution'?

Thanks, Khalid

bioguy2018 commented 5 years ago

I only can advice you on your second part of questions there is no rule of thumb for that. I also don't know what do you exactly mean by best suggestion resolution and how did you assess that. This is a general problem for many supervised clustering methods such as k-mean that user has to provide number of clusters or in this case the resolution which determines the number of clusters. Although there are some indirect ways to assess the clustering quality for example silhouette coefficient which gives you a score between -1 to 1 that tell you how similar your point in each clusters are. The other possibility is that you already expect the number of clusters so you can optimize the resolution based on your previous knowledge. @falexwolf Out of curiosity, can we integrate such methods like silhouette coefficient inside scanpy? that would be cool!

Khalid-Usman commented 5 years ago

Thanks @bioguy2018 for your kind reply. Actually I was confused as for Pbmc3k, Scanpy and previous version of Seurat says it has 8 clusters, but in new version of Seurat clusters are 9. I think it's totally based on biological knowledge rather than optimizing paramters, like resolution.

It will be great to add something like silhouette coefficient in scanpy.

LuckyMD commented 5 years ago

You can easily access the silhouette coefficient via scikit-learn.

I would be hesitant to base optimal numbers of clusters on the silhouette coefficient though. The number of clusters is typically dependent on the biological question of interest. There's not really a scale at which all biological questions can be answered. Therefore you have a resolution parameter to check multiple resolutions. For example, T cells could be taken as one cluster or sub-clustered into CD4+ and CD8+ (which is typically done). Here a problem with the silhouette coefficient also shows: often you have one big cluster of T-cells which reluctantly cluster into the CD4+ and CD8+ subtypes (early 10X datasets show this nicely). This will have a lower silhouette coefficient, but it is probably more informative for many people.

bioguy2018 commented 5 years ago

@LuckyMD I am wondering have you tried it before? or did someone try it with the pbmc data with regards to the t-cells that you mentioned? I agree with you that it can not be an option for answering biological questions but in the case that there is no clear biological knowledge like looking for new sub-types or so it can be an option to get some idea (mathematically).

p.s. it just crossed my mind so I am pushing a technical question also here 😄 Since we would need to calculate the distance matrix, would the input for the function be adata.X ? should it be the raw file?

grst commented 5 years ago

Some notes/observations from my side towards choosing the proper resolution:

the Leiden algorithm depends on a random seed. With a different random seed, you might get a different number of clusters with the same resolution
a sensible resolution depends on the input data: when clustering on data processed with sc.tl.diffmap a much lower resolution will give the same number of clusters than without.
I performed a hyperparameter search for the resolution (steps of 0.005) on a large dataset of CD8+ T cells. I observed that at certain resolution ranges, the number of clusters is stable. In my case, I was looking for subtypes of CD8+ T cells and hypothesized that at ~0.1 and ~0.3 I would find something biologically meaningful. Would be interesting to re-do that on the PBMC dataset. I would expect a plateau at a resolution that recovers the well-known cell types CD8+, CD4+, etc.

2019-06-03_09:53:34_911x604 Fig: hyperparameter search for resolution in steps of 0.005. The graph shows the resolution vs. detected number of Leiden-clusters.

Khalid-Usman commented 5 years ago

@LuckyMD I think we cann't use Silhouette co-efficient for the data like single cell. Where the are chances we have clusters with few points and silhouette won't be able to detect it separate cluster. E.g. in Pbmc3k dataset 'Megakaryocytes' and 'Dendritic' cell type will not be marked as a separate cluster by using your suggested co-efficient.

Khalid-Usman commented 5 years ago

@grst Thanks it seems logical, but,

It is mentioned in Seurat Pbmc3k example that best resolution parameter is 0.6-1.2 , but you used less and get more clusters. May be because i didn't explore random seed in leiden. In louvain and leiden we usually optimize 'modularity' value, what if we just calculate modularity values for different resolution instead of optimizing for given resolution and then for resolution where 'modularity' is maximum, we optimized 'modalarity'. Is this ok ? But i also think that 'modularity' increases when we have small number of clusters. Any suggestion ?

grst commented 5 years ago

It is mentioned in Seurat Pbmc3k example that best resolution parameter is 0.6-1.2 , but you used less and get more clusters. May be because i didn't explore random seed in leiden.

I suspect this is because I processed the dataset with sc.tl.diffmap. The random seed tends to make only minor differences (+/- 1). As far as I can tell, the resolution parameter really is dataset dependent. But maybe someone with a better knowledge of the actual algorithm can comment on this.

bioguy2018 commented 5 years ago

@Khalid-Usman Of course you can use silhouette coefficient for any kind of clustering in principal. you just need to choose the "best option" depending on your dataset which is again depending on what you are interested in and then you can validate it by looking into your clusters markers. I am actually very curious to see the T-cells case that was mentioned here....you can also have a look here: https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set Again I would like to mention using such control parameters are mathematical methods to assess your clustering quality it might has nothing to do with the biological side and they can be actually helpful when you have no clue over the number of clusters you would look for so you would reply only on mathematics ! your question is really topic specific that what you look for and in which case you want to assess your data...if you already have an estimation or no.......there are also other ways to look for the quality of the data as grst mentioned but again it really depends on what you look into.

Siliegia commented 4 years ago

It would be nice if scanpy could provide the maximum modularity obtained in the Louvain and Leiden Algorithm.

gokceneraslan commented 4 years ago

@Siliegia I actually needed this many times. Here is a PR: https://github.com/theislab/scanpy/pull/819

chris-rands commented 3 years ago

Is there anything like clustree in python that integrates nicely with scanpy?

xguse commented 3 years ago

Is there anything like clustree in python that integrates nicely with scanpy?

I have resorted to writing a small Rscript that takes a saved adata.h5ad file as input, loads it using reticulate, runs Clustree, saves it. I then run the script from a notebook using invoke.run from the invoke package as a function in a notebook and load the output figure as an image in the notebook.

Here is the script I use in case it helps:


suppressPackageStartupMessages({
    library(reticulate)
    library(SingleCellExperiment)
    library(glue)
    library(clustree)
})
sc <- import("scanpy")

args <- commandArgs(trailingOnly = TRUE)
H5AD_PATH = args[1]
OUT_PATH = args[2]

print(glue("H5AD_PATH: {H5AD_PATH}"))
print(glue("OUT_PATH: {OUT_PATH}"))

load_adata = function(h5ad_path) {
    adata <- sc$read_h5ad(h5ad_path)

    return(adata)
}

count_clusterings = function(adata){
    # Ryan suggests:
    # length(grep("leiden",names(adata$obs)))

    clusterings = c()
    for (x in adata$obs_keys()){
        if (startsWith(x, "leiden")){
            clusterings = append(clusterings, x)
        }
    }

    return(length(clusterings))
}

set_fig_dimensions = function(num_clusterings){
    width = 10
    height = (0.6 * num_clusterings)

    if (height < 8){
        height = 8
    }

    png(width = width, height = height)
    options(repr.plot.width = width, repr.plot.height = height)

    return(list(width=width,height=height))
}

adata = load_adata(h5ad_path=H5AD_PATH)

dims = set_fig_dimensions(num_clusterings = count_clusterings(adata))
# dims

# options(repr.plot.width = 10, repr.plot.height = 10)

g = clustree(
    x=adata$obs,
    prefix="leiden_",
#     suffix = NULL,
#     metadata = NULL,
#     count_filter = 0,
#     prop_filter = 0.1,
#     layout = "sugiyama",
#     layout = "tree",
#     use_core_edges = FALSE,
#     highlight_core = FALSE,
#     node_colour = prefix,
#     node_colour_aggr = NULL,
#     node_size = "size",
#     node_size_aggr = NULL,
#     node_size_range = c(4, 15),
#     node_alpha = 1,
#     node_alpha_aggr = NULL,
#     node_text_size = 3,
#     scale_node_text = FALSE,
#     node_text_colour = "black",
#     node_label = NULL,
#     node_label_aggr = NULL,
#     node_label_size = 3,
#     node_label_nudge = -0.2,
#     edge_width = 1.5,
#     edge_arrow = TRUE,
#     edge_arrow_ends = c("last", "first", "both"),
#     show_axis = FALSE,
#     return = c("plot", "graph", "layout"),
) #+ scale_color_brewer(palette = "Set3")

g

ggsave(
    OUT_PATH, 
    width = dims$width, 
    height = dims$height, 
    dpi = 100
)

I then run it like this from the notebook:

LuckyMD commented 3 years ago

@lazappi this comment was made for you to answer ;)

lazappi commented 3 years ago

I don't have a lot to add. As far as I know there's no native Python implementation of clustering trees. If you want to use the R {clustree} package you will need to transfer your data from R to Python in some way. Using straight {reticulate} to read a .h5ad file like you have here is one option but there are packages that will do it for you including {zellkonverter}, {anndata} and {SeuratDisk}. Once you have a SingleCellExperiment or Seurat you can plug that directly into {clustree}. It should also be possible to call {clustree} from Python using anndata2ri but I'm not sure of the details of how to do that.

If you only want a basic clustering tree you could just transfer the clustering assignments (by saving to CSV for example). That would probably be easier/quicker than transferring the whole dataset but you would lose the opportunity to overlay other information such as marker gene expression (which is often really helpful). Unless you plan ahead and append that to whatever you save to disk.

Sorry, that was longer than I thought 😸! Hope it helps.

chris-rands commented 3 years ago

Thank you both for taking the time to answer. I was hoping there is a python port/re-implementation of clustree, but this is clearly not the case. Nonetheless, the workarounds you suggest are helpful

lvxhnat commented 1 year ago

Hopefully I am not too out of date to ask this question. Extending on this discussion, I was wondering how a few of you @bioguy2018 @Khalid-Usman @LuckyMD calculate the Silhouette Scores for your graphs? The simplest way I can think of to extract the vectors required for the calculation will be to use the adjacency matrices as vectors. However, I quickly run into memory issues on large datasets with >= 100K nodes? (Each vector will contain 100K elements) I couldn't even load the matrix into memory to perform any form of dimension reduction.

LuckyMD commented 1 year ago

Hi @lvxhnat,

You can check how we use silhouette scores here: https://github.com/theislab/scib

scverse / scanpy

Scanpy Clustering #670