Returning cluster assignments as str conflicts with matplotlib color sequences

scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.

https://scanpy.readthedocs.io

BSD 3-Clause "New" or "Revised" License

1.87k stars 595 forks source link

Returning cluster assignments as str conflicts with matplotlib color sequences #1030

Open dburkhardt opened 4 years ago

dburkhardt commented 4 years ago

Currently, sc.tl.louvain etc return cluster assignments as a Categorical with dtype str resulting in incompatibility with matplotlib color sequences. For example, the following code raises a ValueError:

import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt

adata = sc.AnnData(np.random.normal(size=(100,2)))
sc.pp.neighbors(adata)
sc.tl.louvain(adata)
plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain'])

The error is: ValueError: RGBA values should be within 0-1 range. Funnily enough, this used to work due to a bug in matplotlib that was fixed in https://github.com/matplotlib/matplotlib/pull/13913.

Note, the following code works as intended:

plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain'].astype(int))

I would have submitted a PR changing this behavior had I not noticed that returning cluster assignments as str is explicitly checked here:

https://github.com/theislab/scanpy/blob/78125e6355c0cd2c4ae930495829282eea6f4a52/scanpy/tools/_utils_clustering.py#L11-L23

This brings up a larger design question in scanpy / anndata: Why are arrays of numerics routinely converted to strings representing numbers?

In https://github.com/theislab/anndata/issues/311 I found a case where converting arrays of numerics to strings creates a bug when assigning to AnnData obsm with DataFrames with a RangeIndex. In that case, I understand there's a desire to avoid ambiguity in positional vs label indexing, but that issue was solved in pandas with the .loc and .iloc conventions. Why not carry that forward?

In this case, why not just return cluster assignments as arrays of numerics as is done in sklearn.cluster?

I think following these conventions will make both tools much more accessible to the general Python data science community.

LuckyMD commented 4 years ago

Hey! So one reason I can think of why it's important that .obs covariates are strings is that matplotlib will assume that numerical covariates lie on a continuous scale and thus colour this with a continuous colour scale and provide the corresponding colour bar. Typically that is not what you want for louvain clusters. These are inherently categorical, so the conversion to string is used to further convert to pd.Categorical via sanitize_anndata().

From my point of view the .loc and .iloc convention isn't particularly intuitive for new users, so I wouldn't be in favour of that setup. I'm not sure I see the issue with converting numerical values to strings if what you are using these as are labels, and thus categories (e.g. obs_names or other). Integers are after all values which have an inherent ordering and a defined distance, which is not a characteristic you would assign to an index.

scottgigante commented 4 years ago

So it appears to me that the difference between the discrete and continuous colours is purely an internal scanpy decision. Plotting with matplotlib and a pd.Categorical returns the same error as before.

An alternative would be to explicitly return a categorical from the clustering function, i.e. rather than ensuring that the clustering returns an array of str, ensure that it returns a categorical where the categories are ints.

Categorical (string) output: scanpy works, matplotlib errors:

![image](https://user-images.githubusercontent.com/8499679/73891608-a1ee1e80-4842-11ea-97b8-16c4618a894f.png)

Integer output: matplotlib works, scanpy mistakenly uses continuous colormap:

![image](https://user-images.githubusercontent.com/8499679/73891676-bd592980-4842-11ea-8043-5ed74693ee28.png)

Cateogrical (integer) output: both work.

![image](https://user-images.githubusercontent.com/8499679/73891704-ccd87280-4842-11ea-91c1-445b1574d812.png)

ivirshup commented 4 years ago

In https://github.com/theislab/anndata/issues/311 I found a case where converting arrays of numerics to strings creates a bug when assigning to AnnData obsm with DataFrames with a RangeIndex. In that case, I understand there's a desire to avoid ambiguity in positional vs label indexing, but that issue was solved in pandas with the .loc and .iloc conventions. Why not carry that forward?

Also, the index of a dataframe in obsm must match obs_names.

An alternative would be to explicitly return a categorical from the clustering function, i.e. rather than ensuring that the clustering returns an array of str, ensure that it returns a categorical where the categories are ints.

We do explicitly return a categorical, it's just a categorical of strings. There are a few reasons for this:

Some plotting libraries (more-so when the methods were written) don't respect that a categorical array with numeric values is categorical. seaborn has some very weird behavior around this.
When subclustering within a cluster, the sub-cluster is named like: "{previous}-{sub}". You can't do that with a numeric value. By just always using string categoricals we can be consistent about the type resulting from sc.tl.leiden this way.
Unclear what the advantage of using integer values for clustering names would be. Performance should be the same since they are categoricals.

About plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['louvain']), I think most of the scientific python ecosystem would like it if c could be categorical, and that it would mean categorical palette would be used. There's even an issue by our own @flying-sheep about it https://github.com/matplotlib/matplotlib/issues/6214, but it sounds like it's not gonna happen anytime soon.

dburkhardt commented 4 years ago

Oh sorry, so actually in my original post, I added the wrong code that works. Matplotlib apparently has already added support for categoricals as long as the categories are numerics. For example, the following code works as intended:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = np.random.normal(size=(100,2))
colors = pd.Series(data[:,0], dtype='category')
sizes = pd.Series(data[:,1], dtype='category')

plt.scatter(data[:,0], data[:,1], c=colors, s=sizes)

I made a note of this https://github.com/matplotlib/matplotlib/issues/6214

Thanks @ivirshup for pointing me to https://github.com/theislab/anndata/issues/35 and https://github.com/theislab/anndata/issues/31. I'm not convinced that positional vs label indexing is so complicated to understand that people will find scanpy difficult to use if you start adopting an iloc vs loc syntax. I agree that it makes the learning curve a little steeper, but it enables greater comparability with the ecosystem of data science tools in python. It looks like there are some strong opinions here though, and I don't want to start a flame war. Scanpy is an excellent piece of software, and I greatly appreciate at the work that goes into it.

Responding to @LuckyMD, I again would just point out that returning cluster labels as ints is the standard for sklearn, and I would urge that scanpy serve as an access point to single cell analysis both for biologists and also for data science / machine learning researchers. Biologists will likely stick to using scanpy's plotting functions where you can handle default color maps for things that appear to be labels. We do this kind of checking in scprep: https://github.com/KrishnaswamyLab/scprep/blob/09de1bf41c4b42d331b29a4493c436110b641e07/scprep/plot/scatter.py#L206-L253.

However, for machine learning researchers who likely have their own preferred plotting tools in matplotib or seaborn, might be trying to use the results from clustering in scanpy to compare to results from sklearn.cluster, or otherwise want to fit scanpy into their analysis pipelines, turning arrays of numerics into arrays of strings causes headaches that make the tools less accessible. The argument about the default colormap in matplotlib is continuous seems less important than making scanpy compatible with the larger ecosystem of data science tools in Python.

Finally, I will note that in Python, strings are also defined ordinally, even if you might not think of them that way. Although in some respects the question, "Is '1' less than 'a'?" is nonsensical, this is a well defined test in Python

In [1]: '1' < 'a'
Out[1]: True

Again, I want to emphasize that I really love what has been done with scanpy / anndata so far. We use it in various places in our single cell workshop (https://krishnaswamylab.org/workshop), and I rely on the implementations of louvain / paga / dpt for my research. I bring up these issues here because I think changing some of these conventions could result in greater widespread adoption that I would love to see for scanpy and anndata.

LuckyMD commented 4 years ago

Hey @dburkhardt,

I'm definitely not perceiving this as a flame war, so no worries ;). It's good to discuss priorities like this. You are however right, that I have a strong opinion about this... and may have been a bit German/direct in how I presented it ^^. To be honest, I'm not a huge fan of the .loc and .iloc conventions in the first place and I frequently get frustrated with how slicing works in numpy as well. So I have always been quite happy with AnnData's solution here. There should however be a better documentation around these decisions/conventions. My feeling is that people who are familiar with pandas and numpy conventions won't have a hard time coming to grips with the conventions in AnnData, but that won't necessarily be the case if the choices were switched around.

dburkhardt commented 4 years ago

Great! Glad to have the discussion. I think there's a lot to talk about here, and it seems like a lot of it circles around how scanpy / anndata should interact with the greater ecosystem of tools for data analysis in Python.

I think there are conventions in numpy / pandas / sklearn / matplotlib ecosystems that result in a steep learning curve. I am very supportive of making that learning curve more accessible. I think it's great to provide helper functions that "just work." I think of the the filtering, normalization, and plotting functions especially. I also am very in favor of accessible tutorials and documentation and workshops that make using these tools approachable for a lay audience that may not understand the distinctions between various APIs. I've relied heavily of these kinds of resources as I've learned how to program within this ecosystem, and I've seen how helpful they can be for new users.

What I find less desirable here is introducing incompatibilities or breaking conventions used in the broader numpy / pandas / sklearn / matplotlib ecosystems to lower the barrier to entry for scanpy. I agree with you that using numerics to represent clusters is counter-intuitive when these integers actually represent discrete labels. However, I don't find this to be a compelling reason to break the convention used in the broader data analysis ecosystem. If I want to compare louvain to spectral clustering in Python, I need to use scanpy and sklearn and I want this to "just work". I agree with you that iloc vs loc indexing is not straightforward to lay users, but I think it's a mistake to change the convention for how one indexes positionally vs using labels. Especially when the underlying data structures is often a dataframe. Instead of breaking these conventions, I would love to see the tool "just work" and make sure the tutorials and documentation make the conventions exceedingly clear for new users.

I'm not sure what's the best way to resolve this, because I think this line of thinking results in a couple larger design questions for scanpy as well. For example, should AnnData objects be valid input for numpy ufuncs? I.e. should the following code work?

import numpy as np
import pandas as pd
import scanpy as sc

data = pd.DataFrame(np.random.normal(size=(100,2)))
adata = sc.AnnData(data)
np.sqrt(adata)

Currently this raises a TypeError. Why shouldn't this "Just work"? What about the convention of returning a copy by default, instead of modifying objects in place? I can't think of many other Python toolkits that don't return a copy when you perform some operation on a data object.

I would really love to use scanpy / anndata more in my day to day work. Right now, this lack of compatibility is the hurdle that prevents that. I disagree that people who are familiar with pandas and numpy will have an easy time coming to grips with leveraging a tool that doesn't interact well with the greater ecosystem of data analysis tools in Python. I think these users are more likely to use scanpy / anndata only to get access to the methods that are only implemented in scanpy, and then return to the ecosystem of tools that all work together. I can only speak to my experience, but this is how I use scanpy. If scanpy were to adopt greater inter-compatibility, I would be happy both to use it more and also to help contribute to its development and documentation.

I'm excited to hear your thoughts!

LuckyMD commented 4 years ago

I can understand your thought process behind facilitating the integration of anndata into the broader ecosystem and I can also understand the frustration. I don't think the integration is quite as bad as you suggest though. adata.X is still a numpy.ndarray and can be used as such, exactly as adata.var and adata.obs are dataframes. The only issue is when you require the object to work as a whole data structure in a particular function. I'm not the most experienced numpy user, but from what I've seen, you would typically expect any numpy function that you apply to an anndata object to be applied to adata.X and don't require information in other parts of the object. Or am I missing a use case here? So the only change would then be that adata = np.srqt(adata) would need to become adata.X = np.sqrt(adata.X).

Furthermore, it's not entirely clear what a numpy function applied to an AnnData object should do. np.min() could be on adata.X or any column in .obs or .var. You can call it on the columns in the pandas dataframes already via pandas conventions... which makes a bit more sense to me.

Regarding the slicing conventions... @ivirshup has mentioned a few reasons why things are sliced as they are in scanpy. What would your suggestion look like? loc and iloc work for adata.obs and adata.var atm. Would you forbid an adata['Cell A',:]?

scottgigante commented 4 years ago

I think pandas provides a good template for the question of np.min(adata). np.min(df) gives the minimum value stored in the dataframe, not the minimum value in the Index (aka obs) or Columns (aka var). Given AnnData is basically a way of storing data and metadata associated with both the rows and columns of that data, it goes without saying (in my opinion at least) that numerical methods applied to adata should be applied to adata.X.

Re: slicing, I think it makes sense to have explicit slicing for one or the other (i.e. loc and iloc) and then a default slicing (i.e. adata['Cell A',:]) which takes both position-based and name-based slicing if the two are unambiguous. It wouldn't be hard to include a check that says if the names are a) integers and b) not simply a RangeIndex (ie names and positions are the same) then throw a warning or an error asking the user to specify which of name or position they want.

LuckyMD commented 4 years ago

So the question is just about passing adata or adata.X then? This doesn't sound like a particularly difficult problem to solve and wouldn't require structural changes in AnnData.

Indeed one could check whether an index is in .obs_names or .var_names. @ivirshup is just busy trying to speed up slicing in AnnData though, so maybe he can comment on this.

scottgigante commented 4 years ago

I feel like the np.min(adata) is more emblematic of the issue at hand here, which is how hard we should work to integrate with the rest of the python ML/data science ecosystem, e.g. matplotlib.pyplot.scatter. My personal view is nothing that works with a pandas DataFrame shouldn't work with an AnnData object; if you make it harder for people to work with AnnData than the most obvious competing data structure, they will simply use that other object.

LuckyMD commented 4 years ago

I still fail to see where it is harder to work with AnnData than with pandas. But maybe I'm the wrong person to comment on this, as I don't work as much matplotlib plotting (more seaborn and scanpy). Also, pandas is an inherently simpler structure than AnnData, so not really a competing project from my point of view. We have to worry about scaling in several dimensions, which is quite different than pandas.

dburkhardt commented 4 years ago

So I think the issue here isn't that AnnData is harder to work with than pandas, it's that there are several API choices in scanpy / AnnData introduce incompatibility with other tools in the ecosystem, which is generally undesirable. I can understand if you just look at scanpy and AnnData as standalone packages for single cell analysis in Python, then this doesn't seem like a big deal. However, I think these tools, especially AnnData, have the potential to serve the broader Python data analysis community. scanpy might be limited to people who are exclusively looking at single cell data, but AnnData definitely has utility outside of single cell (which I thought was why the documentation doesn't discuss scRNA-seq much).

The good news is with most of these, relatively simple changes would make these tools all inter-compatible in ways that "just work." Among these changes are:

Return cluster labels as ints
Support non-string indexes (and adopt loc vs iloc)
Support ufuncs with AnnData
(maybe) Return copies of input for most scanpy functions

Now I'm not saying there aren't reasons for keeping the conventions that have been selected, but it's definitely true that these conventions are different from the conventions in numpy, pandas, and sklearn. I think where Scott and I are coming from is the perspective that unless it would be unbearably difficult to keep to those conventions, it's generally better to stick to conventions used in the larger data analysis ecosystem.

I'm not sure I agree that pd.DataFrame and and AnnData don't compete when it comes to people who are doing single cell analysis in Python. What do you mean by "have to worry about scaling in several dimensions"?

I think sparse DataFrames with a MultiIndex are similar to AnnData objects. It's just that AnnData objects have a more consistent API for supporting sparse data structures, having the obs and var annotations be DataFrames is more convenient and efficient than MultiIndex for slicing, and AnnData has a handy uns slot for other miscellany that's just helpful.

ivirshup commented 4 years ago

It's cool to see people put so much thought into the discussion here! I think a lot of the ideas here are things that have been tossed around in the past for anndata. There's a lot here, so I'm just going to respond to @dburkhardt's points for now.

I agree with you that using numerics to represent clusters is counter-intuitive when these integers actually represent discrete labels. However, I don't find this to be a compelling reason to break the convention used in the broader data analysis ecosystem. If I want to compare louvain to spectral clustering in Python, I need to use scanpy and sklearn and I want this to "just work".

Return cluster labels as ints

I'm not sure using strings breaks any compatability. Doesn't scikit-learn work fine with strings representing categories?

Example of sklearn working with string categories

```python from sklearn import metrics import numpy as np from string import ascii_letters x = np.random.randint(0, 10, 50) y = np.array(list(ascii_letters))[np.random.randint(0, 10, 50)] metrics.adjusted_rand_score(x, y) ```

but I think it's a mistake to change the convention for how one indexes positionally vs using labels

Support non-string indexes (and adopt loc vs iloc)

I don't think the conventions are so set in stone. Numpy behaves differently than pandas, which behaves differently than xarray. I personally like the conventions of DimensionalData.jl, but think xarray is a likely the direction we'll head.

Support ufuncs with AnnData

What does np.log1p(adata) return? Is it the whole object? Do we want to copy the whole object just to update values in X?

I think probably not. I also think AnnData <-> pd.DataFrame is the wrong analogy. In my view, an AnnData object is a collection of arrays, more akin to an xarray.Dataset, Bioconductor SummarizedExperiment, or an OLAP cube.

I think a syntax that could work better would be something like:

adata.apply_ufunc(np.log1p, in="X", out="X")
adata.apply_ufunc(np.log1p, in=("layers", "counts"), out=("layers", "log_counts"))

As an aside, I think we could do something similar with sklearn style transformers, i.e.

clf = SVC.fit(labelled, X=("obsm", "X_pca"), y="leiden")
clf.predict(unlabelled, X=("obsm", "X_pca"), key_added="transferred_labels")

(maybe) Return copies of input for most scanpy functions

I think a core advantage of scanpy over the bioconductor ecosystem is the performance. If we always returned copies by default, a lot of that would go away.

dburkhardt commented 4 years ago

Hi, I just wanted to bring this back up again because I've been logging some of the issue's I've encountered. It seems we're at a bit of a philosophical divide, so perhaps it's best for me to just register which use cases I have that AnnData / scanpy are personally causing me friction:

Instead of pasting all errors, I'm just going to paste code blocks I wish worked. Note, these are actual use cases I have regularly encountered.

1. Cannot pass AnnData to numpy or sklearn operators

import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import decomposition, cluster

data = np.random.normal(size=(100,10))
adata = sc.AnnData(data)

# All of the following raise errors
np.sqrt(adata)
adata[:, adata.var_names[0:3]] - adata[:, adata.var_names[3:6]]

adata.obsm['X_PCA'] = decomposition.PCA(2).fit_transform(adata)

To answer the question above, I think it should return the whole AnnData object, like how DataFrames return themselves. I don't know if we think it should "update" the original AnnData. I'm also confused by how this results in a performance decrease? If I do adata = np.sqrt(adata) then isn't this the same footprint as modifying inplace? If I do adata_sq = np.sqrt(adata) then my intention is to duplicate the adata object. In this case, it is my intention to create a duplicate object, and I would like AnnData to respect this intention. 2. Requirement to use .var_vector or .obs_vector for single columns

# This works as expected
adata[:, adata.var_names[0:3]]

# I wish this did as well.
adata[:, adata.var_names[0]]

3. .var_vector doesn't return a Series

pdata = pd.DataFrame(data)
# Returns series
pdata[0]

# Returns ndarray
adata.var_vector[0]

4. Clusters as categories creates confusing scatterplots

sc.pp.neighbors(adata)
sc.tl.leiden(adata)

plt.scatter(adata.obs['leiden'], adata.X[:,0])

Produces the following plot. I would like it to have order 0-5 by default

5. Cannot pass clusters to c parameter in plt.scatter I would like this to just work. Instead it throws a huge error.

plt.scatter(adata.X[:,0], adata.X[:,1], c=adata.obs['leiden'])

6. Clusters as categories frustrate subclustering I understand this is a niche application, but like 4 and 5, this would be fixed by matching the output of sklearn.cluster operators.

sc.pp.neighbors(adata)
sc.tl.leiden(adata)

cluster_zero = adata[adata.obs['leiden'] == '0']
sub_clusters = cluster.KMeans(n_clusters=2).fit_predict(adata.X)

# Here I'm trying to break up cluster '0' into subclusters with 
# new names that don't clash with the existing clusters
# However, np.max() and the + operators aren't well defined for 
# cateogricals of strings
sub_clusters = sub_clusters + np.max(adata.obs['leiden'])

dburkhardt commented 4 years ago

@ivirshup I'd be interested to hear more about the differences between DataFrames and xarrays. The website for xarray makes it sound like they're very similar.

Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.

Is there a dichotomy here? What differentiates AnnData from Datasets?

ivirshup commented 4 years ago

1. Passing anndata objects to numpy and sklearn operators

I think this would be great! This would be easy to implement if python had generic functions. This is kinda something that's being worked on for numpy, but the assumptions a ufunc has about it's input data does not match with what an AnnData object is.

I've worked on a side project of just wrapping the sklearn transformers so you can pass anndata objects, and could try and get that cleaned up for use if it'd be valuable.

I'm not really sure what you expect this line to do though:

adata[:, adata.var_names[0:3]] - adata[:, adata.var_names[3:6]]

I would probably throw an error for that, since the var names wouldn't be the same. It's also not obvious to me which arrays would be subtracted (all of them? some of them?).

If this is meant to do:

adata[:, adata.var_names[0:3]].X - adata[:, adata.var_names[3:6]].X

I don't think that's so much more work.

I think it should return the whole AnnData object, like how DataFrames return themselves. I don't know if we think it should "update" the original AnnData. I'm also confused by how this results in a performance decrease?

If it should return the whole object, but not update the original, then all of the values from the original need to be copied to prevent unintentional modification. This is really expensive for large objects, which single cell datasets often are.

For your example of adata = np.sqrt(adata) vs adata_sq = np.sqrt(adata), there's no way for us to tell which of those statements was made while evaluating np.sqrt. That would require the ability to overload assignment, and for python to have different evaluation rules.

2. Requirement to use .var_vector or .obs_vector for single columns

Is what you're saying that you want: adata[:, adata.var_names[0]].X to be one dimensional?

This used to be the behaviour, but it got confusing quickly. Suddenly, adata.X could be a different shape from adata. I would recommend reading the issues that were opened about this on anndata for more context. Here's one of the main ones: https://github.com/theislab/anndata/issues/145.

Another issue is that scipy.sparse has no such thing as a 1-dimensional sparse array. This is a long standing problem, which I'll write a bit more about in the context of xarray.

3. .var_vector doesn't return a Series

I remeber thinking about this as a possibility. IIRC I decided against this because I just as frequently wanted some other column, like "gene_symbols" as the index. I could see adding this as an option via a keyword argument now. But maybe you just want to use sc.get.obs_df?

4. Clusters as categories creates confusing scatterplots

Well, there is no order to the categories, so I guess I see why matplotlib wouldn't plot those in sorted order, but agree it's a little counter intuitive. Seems like more of a matplotlib issue to me though.

5. Cannot pass clusters to c parameter in plt.scatter

Use one of these?

sc.pl.scatter(pbmc, x=pbmc.var_names[0], y=pbmc.var_names[1], color="leiden")   

import seaborn as sns
sns.scatterplot(pbmc.X[:, 0], pbmc.X[:, 0], hue=pbmc.obs["leiden"])

Categorical values for scatter plots are a known issue for matplotlib, as I linked to above. Their current behaviour if you pass a numeric valued categorical (regardless of whether it's ordered) is to use a continuous color palette, which in my opinion is easily misleading.

6. Clusters as categories frustrate subclustering

We've used a different convention for subclustering, which is actually the reason we use strings. We're assuming you're breaking a cluster or set of clusters into smaller ones, so the new id is appended to the old one. I believe there's a tutorial with this somewhere. Do you know where this was @LuckyMD?

Basically, I'd say do something more like:

from collections.abc import Iterable
import numpy as np
import pandas as pd
from sklearn import cluster

def kmeans_subcluster(adata, orig_key, orig_clusters, key_added):
    if isinstance(orig_clusters, str) or not isinstance(orig_clusters, Iterable):
        orig_clusters = [orig_clusters]

    subset = adata[adata.obs[orig_key].isin(orig_clusters)]
    sub_clustering = cluster.KMeans(n_clusters=2).fit_predict(subset.X)

    # Make new clustering

    new_clustering = adata.obs[orig_key].copy()
    # Make new names
    new_clusters = ",".join(orig_clusters) + "-" + pd.Series(np.unique(sub_clustering), dtype=str)
    new_clustering.cat.add_categories(new_clusters, inplace=True)
    new_clustering[subset.obs_names] = new_clusters[sub_clustering]

    # Add back to adata
    adata.obs[key_added] = new_clustering

Or if you wanted something more generic:

```python from typing import Callable, Collection from anndata import AnnData def subcluster( cluster_func: Callable[[AnnData], Collection], adata, orig_key, orig_clusters, key_added, ): """ Params ------ cluster_func Function that produces a clustering of observations from an anndata object. adata orig_key Key in adata.obs for original clustering orig_clusters Set of clusters to subset to before clustering. key_added Key in obs to add resulting clustering to. """ if isinstance(orig_clusters, str) or not isinstance(orig_clusters, Iterable): orig_clusters = [orig_clusters] subset = adata[adata.obs[orig_key].isin(orig_clusters)] sub_clustering = ( pd.Categorical(cluster_func(subset)) .map(lambda x: ",".join(orig_clusters) + "-" + str(x)) ) assert not (adata.obs[orig_key].isin(sub_clustering.categories)).any() # Create new cluster assignment new_clustering = ( adata.obs[orig_key].astype("category", copy=True) .cat.add_categories(sub_clustering.categories) ) new_clustering[subset.obs_names] = sub_clustering.astype(str) # Add back to adata adata.obs[key_added] = new_clustering from functools import partial subcluster_kmeans = partial( subcluster, lambda x: cluster.KMeans(n_clusters=2).fit_predict(x.X) ) ```

This will still work if you've already assigned some labels too.

ivirshup commented 4 years ago

Xarray and anndata

Theoretically, AnnData objects are kind of like a special case of xarray.Datasets. While AnnData objects have an obs and a var dimension xarray.Dataset can have any number of dimensions. AnnData objects are just specializing to the the 2d case.

I think it would make a lot of sense to eventually have anndata and scanpy be based on xarray, or something like it. In practice there are a number of difficulties here. The biggest one is support for sparse data, and I'll briefly point out a couple others.

Sparse arrays

I could probably rant about this for a while, since it's always a problem. Efficient processing of scRNA-seq data needs sparse matrices. The only fully featured sparse array library in python right now is scipy.sparse. All of it's sparse arrays only follow the np.matrix interface, which is deprecated. This means that they only kind-of work like arrays, and need to be special cased pretty frequently.

xarray seems to work pretty well with a number of different array types, as long as they act like np.ndarrays. They have explicit support for pydata/sparse, but that library isn't well supported by the rest of the ecosystem, probably because it doesn't have CSR or CSC matrices yet. This leaves xarray with a level of sparse array support that isn't usable for us.

Pairwise arrays and other weird behaviour

Having an array where multiple axes have the same name doesn't work well with xarray. This is a problem if you're frequently using adjacency matrices like we do.
xarray.DataArrays do not necessarily have the same behaviour as numpy arrays. For example, they have specific behaviour for matrix multiplication. Any transition would be much easier if DataArrays could be used as drop-in replacements for numpy arrays (plus some errors for misaligned data). We need to map this out more before we could make any attempt at integrating the libraries.

LuckyMD commented 4 years ago

I believe there's a tutorial with this somewhere. Do you know where this was @LuckyMD?

I'm not so familiar with the scanpy tutorials, but I do show sub-clustering in the single-cell-tutorial notebook here

dburkhardt commented 4 years ago

Thanks for the long response @ivirshup!

For 1. I think a ufunc should always act on adata.X and I want it to return the adata object with the sqrt applied to adata.X. Adding support for the sklearn operators would be great

For the second part, my intention is for the result to be adata[:, adata.var_names[0:3]].X - adata[:, adata.var_names[3:6]].X, and it's fine with me in the varnames are lost so long as the obsnames are kept. If they're not the same shape, then I would expect the same error as pandas throws.

For 2. I think its okay if you return a dense 1-d array when I access a single column vector. I don't understand where the confusion is coming in with adata.X changing when you access a single column, but that's not been an issue for me.

For the rest, I hope you can survey the community to figure out how rare my use-cases are. I would like scanpy / anndata to fit into my existing workflow that I picked up while learning matplotlib / pandas / numpy. I want slicing an AnnData to behave like slicing a DataFrame; I want clusters to be ints; I want to apply a transformation to a data-container and get the whole container returned with the transformation applied to the values.

I can come up with workarounds for all of the choices you've made here. That's not the issue. I raised this comment because these workarounds add overhead to getting my work done. I'm not going to change my work flow to match your design choices where they diverge from the apis for sklearn / numpy / pandas etc. I know I'm not the only one with these wants (e.g. @scottgigante has similar frustrations), but I don't know how prevalent these frustrations are.

I think at the end of the day, my concern here boils down to what infrastructure you put in place to make sure the needs of the community are balanced with the intentions of the developers. I think the efforts be cellxgene are a great model for this, and I would happily get involved with figuring out the best way to incorporate community feedback into the development of scanpy / anndata.

All this said, your tools do provide a bunch of amazing functionality that I rely on for my PhD. I really appreciate all the effort you've put in. I especially love how easy it is to run louvain / leiden, and how supportive you've been to people adding external tools to scanpy so they can be made accessible to the broader community of single cell users in Python. Thanks!