scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 600 forks source link

How to merge sub-clusters or rename the categories with identical cluster names? #925

Closed selifeski closed 4 years ago

selifeski commented 4 years ago

Hi,

To have a depth understanding, I wanted to set the resolution high for louvain clustering, but now I cannot merge subclusters. When I try to rename the categories with same cluster name, it gives an error about not having unique names. Yet, I could not find a functional merge_clusters function. Is there anyone having the same issue as me? I would appreciate any help. Thanks!

fidelram commented 4 years ago

the clusters are simply annotations added in the adata.obs pandas dataframe. Thus, to merge the clusters you can create a new column containing your merged clusters. For example:

old_to_new = dict(
    old_cluster1='new_cluster1',
    old_cluster2='new_cluster1',
    old_cluster3='new_cluster2',
)
adata.obs['new_clusters'] = (
    adata.obs['old_clusters']
    .map(old_to_new)
    .astype('category')
)
flying-sheep commented 4 years ago

For general help like this, please go to https://scanpy.discourse.group/

This is also what the issue template says. How could we have made the text more clear so that you’d have found your way there?

grafik

ivirshup commented 4 years ago

@fidelram, I like that! I had been struggling to come up with a concise way of doing this. I wonder if we can make that more concise. Here's one where the mapping can be defined inline, and you don't have to define relationships for the ones that stay the same:

adata.obs['new_clusters'] = (
    adata.obs["old_clusters"]
    .map(lambda x: {"a": "b"}.get(x, x))
    .astype("category")
)
fidelram commented 4 years ago

@ivirshup I like that.

ivirshup commented 4 years ago

Here's a related question, if I want to make a labelling which includes a subset of clusters from a few different solutions, is there a concise way to write that? I.e. I want clusters 1,2, and 3 from clustering A, and clusters 4 and 5 from clustering B.

fidelram commented 4 years ago

I think you will need two steps, one to get clusters 1,2, and 3 from clustering A and other for the rest

danielg52 commented 4 years ago

Can this method apply to Leiden clustering as well? I recapitulated the above code in my program, and my new cluster column returned only NaNs. What should the "old_cluster1" side of the structure look like when I am trying to make that dictionary?

Thanks

zhanglab2008 commented 3 years ago

Can this method apply to Leiden clustering as well? I recapitulated the above code in my program, and my new cluster column returned only NaNs. What should the "old_cluster1" side of the structure look like when I am trying to make that dictionary?

Thanks

I have the same issue...

auesro commented 3 years ago

Just to answer those that, like me, are beginners in python, the solution provided by @ivirshup works perfectly (of course for louvain and leiden, and any other adata.obs that you want to remap):

adata.obs['new_clusters'] = (
    adata.obs["old_clusters"]
    .map(lambda x: {"a": "b"}.get(x, x))
    .astype("category")
)

Where "a" is the name of the category you want to change, and "b" is the new name of the category that you want to change. If you have more categories you want to change simply add more entries to the dictionary like:

adata.obs['new_clusters'] = (
    adata.obs["old_clusters"]
    .map(lambda x: {"a": "b", "c": "d"}.get(x, x))
    .astype("category")
)

@fidelram answer does not work in this specific case because the adata.obs from the louvain (or leiden) algorithm are categories named 0, 1, 2, 3, 4 and you cannot construct a dictionary using '0':'X' because SyntaxError: keyword can't be an expression.

Hope this helps,

Best,

A

liliay commented 2 years ago

Hi guys,

Thank you for sharing your code and explanation. What if I want to rename multiple clusters ["a","c","d"] to "b" ? I have tried a list of elements to change as a key, but it does not work for me.

Thanks in advance for your reply

lamdan2 commented 2 years ago

the below worked for me, I think the Python dict formating has changed. Notice I am also merging clusters by assigning them the same name

old_to_new = { 0:'Astrocytes 1', 1:'Glutamatergic neurons 1', 2:'Astrocytes 2', 3:'Oligodendrocytes 1', 4:'Inhibitory neurons 1', 5:'Glutamatergic neurons 2', 6:'Oligodendrocytes 1', 7:'Unknown', 8:'OPCs', 9:'Glutamatergic neurons 3', 10:'Microglia', 11:'Inhibitory neurons 1', 12:'Tanycytes', 13:'Endothelial', 14:'Astrocytes 3', 15:'Oligodendrocytes 1', 16:'Inhibitory neurons 2', 17:'T cells', 18:'Oligodendrocytes 2', } adata.obs['annotation'] = ( adata.obs['seurat_clusters'] .map(old_to_new) .astype('category') )

flying-sheep commented 2 years ago

I think anndata’s rename_categories should accept non-unique values as argument. Then one could simply do things like

cluster_markers = {
    'CD4 T': {'IL7R'},
    'CD14+\nMonocytes': {'CD14', 'LYZ'},
    'B': {'MS4A1'},
    'CD8 T': {'CD8A'},
    'NK': {'GNLY', 'NKG7'},
    'FCGR3A+\nMonocytes': {'FCGR3A', 'MS4A7'},
    'Dendritic': {'FCER1A', 'CST3'},
    'Mega-\nkaryocytes': {'PPBP'},
}
marker_matches = sc.tl.marker_gene_overlap(adata, cluster_markers)
adata.rename_categories('leiden', marker_matches.idxmax())

As it stands, things like the pbmc3k tutorial are super flaky because they hardcode things like this.

LuckyMD commented 2 years ago

Cool use of .idxmax() here, @flying-sheep! I would still inspect manually though ;).

tingxie2020 commented 2 years ago

Hi guys,

Thank you for sharing your code and explanation. What if I want to rename multiple clusters ["a","c","d"] to "b" ? I have tried a list of elements to change as a key, but it does not work for me.

Thanks in advance for your reply

I have the same question, anyone have the solution? Please let me know. Thank you.

jonrot1906 commented 1 year ago

the below worked for me, I think the Python dict formating has changed. Notice I am also merging clusters by assigning them the same name

old_to_new = { 0:'Astrocytes 1', 1:'Glutamatergic neurons 1', 2:'Astrocytes 2', 3:'Oligodendrocytes 1', 4:'Inhibitory neurons 1', 5:'Glutamatergic neurons 2', 6:'Oligodendrocytes 1', 7:'Unknown', 8:'OPCs', 9:'Glutamatergic neurons 3', 10:'Microglia', 11:'Inhibitory neurons 1', 12:'Tanycytes', 13:'Endothelial', 14:'Astrocytes 3', 15:'Oligodendrocytes 1', 16:'Inhibitory neurons 2', 17:'T cells', 18:'Oligodendrocytes 2', } adata.obs['annotation'] = ( adata.obs['seurat_clusters'] .map(old_to_new) .astype('category') )

For me, adding quotation marks to the cluster ID did the trick. I then just did adata.obs["celltype"] = adata.obs.leiden.map(old_to_new) like shown here.