scverse / squidpy

Spatial Single Cell Analysis in Python
https://squidpy.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
440 stars 81 forks source link

clustering function for features #246

Closed hspitzer closed 2 years ago

hspitzer commented 3 years ago

When writing tutorials, I find myself defining the same clustering function in several notebooks.

def cluster_features(features: pd.DataFrame, like=None):
    """Calculate leiden clustering of features.

    Specify filter of features using `like`.
    """
    # filter features
    if like is not None:
        features = features.filter(like=like)
    # create temporary adata to calculate the clustering
    adata = ad.AnnData(features)
    # adata.var_names_make_unique()
    # important - feature values are not scaled, so need to scale them before PCA
    sc.pp.scale(adata)
    # calculate leiden clustering
    sc.pp.pca(adata, n_comps=min(10, features.shape[1] - 1))
    sc.pp.neighbors(adata)
    sc.tl.leiden(adata)

    return adata.obs["leiden"]

This essentially does scaling+PCA+neighbors+leiden on a set of features. I was wondering if we should include this in squidpy as a convenience function (maybe made a bit more general)? Or should we rather leave these sort of functions outside of squidpy? Is there a solution that I can avoid defining the same function in several notebooks? @giovp

giovp commented 3 years ago

good point. The problem is that there are many parameters that shouldbe exposed, since if true it's a simple function, it wraps quite complex processing steps (where it's key that the user might have to change paramters).

I would see a better option a function that takes a adata_parent and a key in obsm, and return an adata_child with same obs, var as adata_parent.

This is btw very related to the biggest problem of having multi modal data in anndata 😅 and we would not be the only ones facing this...

hspitzer commented 3 years ago

Yes, I agree that this wraps quite complicated processing steps. Maybe they should be explicitly visible for the user. Its just that moving the obsm back and forth is a bit ugly.

Ok, sure so you are proposing a function moving obsm to X, right? So this would translate to:

adata_features = move_obsm(adata, key="features")
sc.pp.scale(adata_features)
sc.pp.pca(adata_features)
sc.pp.neighbors(adata_features)
sc.tl.leiden(adata_features)

and then you can use adata_features directly for sc.pl.spatial because it already contains the gene clusters. Yeah, that could work.

To deal with features efficiently though, I need some sort of mechanism to select which rows of obsm to move (I do that with the like parameter in the function above).

giovp commented 3 years ago

Ok, sure so you are proposing a function moving obsm to X, right? So this would translate to:

yes, something like that.

and then you can use adata_features directly for sc.pl.spatial because it already contains the gene clusters. Yeah, that could work.

yes indeed, in that case youd' have to copy over also adata.uns for images and related metadata

To deal with features efficiently though, I need some sort of mechanism to select which rows of obsm to move (I do that with the like parameter in the function above).

this features are what is moved in adata.X right? Wouldn't it work to just move everything?

hspitzer commented 3 years ago

To deal with features efficiently though, I need some sort of mechanism to select which rows of obsm to move (I do that with the like parameter in the function above).

this features are what is moved in adata.X right? Wouldn't it work to just move everything?

I usually extract all features at once because this is more efficient. In some of the tutorial though I am showing the clustering for only a subset of the features (e.g. only segmentation features or only texture features). For this we need to have a way to filter the pandas table. I can also do that manually, but at this point there is no need to me to use such an extraction function at all.

My point is that I'd like to keep the example notebooks as short as possible, and was wondering if we could make some utility functions that do these steps for us.

giovp commented 3 years ago

ok yes, then making an extractor similar to what we alredy have I think might makes sense. Maybe teh extractor we have can me modified? also understand now about selecting specific features

hspitzer commented 3 years ago

Yeah, it would be nice to use the extractor for this, but currently sc.pl.extract does obsm -> obs. We are talking about obsm -> X. I'm not sure if its best practice to put these two different functionalities in one function? We could have a "destination" argument that can be either obs or X?

giovp commented 3 years ago

I'm not sure if its best practice to put these two different functionalities in one function? We could have a "destination" argument that can be either obs or X?

I like this idea!

giovp commented 3 years ago

I htink this is now done with extract and several tutorials, will close this.

hspitzer commented 3 years ago

Is it? Does extract now also extract obsm -> X? Would still be great to have. Not super urgent though.

giovp commented 3 years ago

it would be cool to have a multiplex partition based on layers/obsm see this https://github.com/theislab/scanpy/issues/1818