scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.9k stars 597 forks source link

cluster average expression #181

Open wangjiawen2013 opened 6 years ago

wangjiawen2013 commented 6 years ago

Dear, Can you add a function to calculate the average expression of each gene in each cluster ?

wangjiawen2013 commented 6 years ago

And, I want change the default colors of the tsne plot, but I don't know the relationship between the parameters 'palette' and 'color_map' and I have tried some settings, but they don't work, Can you help me ?

falexwolf commented 6 years ago

Hi! You can always change the colors of a categorical annotation by directly modifying, e.g. adata.uns['louvain_colors']. It's a bug that palettes doesn't change it when the colors field is present in adata.uns. I'll try to fix this today.

How do you want the mean for each cluster to be visualized or stored? If you run pl.paga(adata, color='mygene') the mean per cluster is plotted.

wangjiawen2013 commented 6 years ago

Some times we need to know the average expression of each gene in each cluster, it's better to store it in a pandas dataframe with rows corresponding to each gene and columns corresponding to each cluster, then we can export it to a csv file.

LuckyMD commented 6 years ago

I have written a function to do this... I could add it to scanpy. But for the moment, this is it:

def marker_gene_expression(anndata, marker_dict, gene_symbol_key=None, partition_key='louvain_r1'):
    """A function go get mean z-score expressions of marker genes
    # 
    # Inputs:
    #    anndata         - An AnnData object containing the data set and a partition
    #    marker_dict     - A dictionary with cell-type markers. The markers should be stores as anndata.var_names or 
    #                      an anndata.var field with the key given by the gene_symbol_key input
    #    gene_symbol_key - The key for the anndata.var field with gene IDs or names that correspond to the marker 
    #                      genes
    #    partition_key   - The key for the anndata.obs field where the cluster IDs are stored. The default is
    #                      'louvain_r1' """

    #Test inputs
    if partition_key not in anndata.obs.columns.values:
        print('KeyError: The partition key was not found in the passed AnnData object.')
        print('   Have you done the clustering? If so, please tell pass the cluster IDs with the AnnData object!')
        raise

    if (gene_symbol_key != None) and (gene_symbol_key not in anndata.var.columns.values):
        print('KeyError: The provided gene symbol key was not found in the passed AnnData object.')
        print('   Check that your cell type markers are given in a format that your anndata object knows!')
        raise

    if gene_symbol_key:
        gene_ids = anndata.var[gene_symbol_key]
    else:
        gene_ids = anndata.var_names

    clusters = anndata.obs[partition_key].cat.categories
    n_clust = len(clusters)
    marker_exp = pd.DataFrame(columns=clusters)
    marker_exp['cell_type'] = pd.Series({}, dtype='str')
    marker_names = []

    z_scores = sc.pp.scale(anndata, copy=True)

    i = 0
    for group in marker_dict:
        # Find the corresponding columns and get their mean expression in the cluster
        for gene in marker_dict[group]:
            ens_idx = np.in1d(gene_ids, gene) #Note there may be multiple mappings
            if np.sum(ens_idx) == 0:
                continue
            else:
                z_scores.obs[ens_idx[0]] = z_scores.X[:,ens_idx].mean(1) #works for both single and multiple mapping
                ens_idx = ens_idx[0]

            clust_marker_exp = z_scores.obs.groupby(partition_key)[ens_idx].apply(np.mean).tolist()
            clust_marker_exp.append(group)
            marker_exp.loc[i] = clust_marker_exp
            marker_names.append(gene)
            i+=1

    #Replace the rownames with informative gene symbols
    marker_exp.index = marker_names

    return(marker_exp)

You just need to add a python dictionary called 'marker_dict' where keys are strings and values are lists of strings. I think it's explained in the function itself as well though.

aditisk commented 5 years ago

@LuckyMD Have you added the function above to Scanpy yet ? If yes, what is it called ? Thanks, Aditi

LuckyMD commented 5 years ago

Hi, I haven't added it yet. But I have a new impetus to do so due to some reviewer's comments... so I will submit a pull request soon ;).

ekernf01 commented 5 years ago

When you add this, or if you have added it already, would you mention the new name on this thread? It's hard for me to know where to look in the docs. Thanks!

LuckyMD commented 5 years ago

It's still not in a PR. I actually forgot about it as well atm... sorry. For now just copy the code above, please.

ivirshup commented 5 years ago

As a general approach to this kind of problem, I write functions like this:


def grouped_obs_mean(adata, group_key, layer=None, gene_symbols=None):
    if layer is not None:
        getX = lambda x: x.layers[layer]
    else:
        getX = lambda x: x.X
    if gene_symbols is not None:
        new_idx = adata.var[idx]
    else:
        new_idx = adata.var_names

    grouped = adata.obs.groupby(group_key)
    out = pd.DataFrame(
        np.zeros((adata.shape[1], len(grouped)), dtype=np.float64),
        columns=list(grouped.groups.keys()),
        index=adata.var_names
    )

    for group, idx in grouped.indices.items():
        X = getX(adata[idx])
        out[group] = np.ravel(X.mean(axis=0, dtype=np.float64))
    return out

Swapping out the last 8 lines or so depending on what I'm calculating. To use a set of marker genes I'd call it as grouped_obs_mean(adata[:, marker_genes], ...).

At some point we might have groupby for AnnDatas, but that'll require figuring out how to be consistent about the returned type.

yeswzc commented 9 months ago

As a general approach to this kind of problem, I write functions like this:

def grouped_obs_mean(adata, group_key, layer=None, gene_symbols=None):
    if layer is not None:
        getX = lambda x: x.layers[layer]
    else:
        getX = lambda x: x.X
    if gene_symbols is not None:
        new_idx = adata.var[idx]
    else:
        new_idx = adata.var_names

    grouped = adata.obs.groupby(group_key)
    out = pd.DataFrame(
        np.zeros((adata.shape[1], len(grouped)), dtype=np.float64),
        columns=list(grouped.groups.keys()),
        index=adata.var_names
    )

    for group, idx in grouped.indices.items():
        X = getX(adata[idx])
        out[group] = np.ravel(X.mean(axis=0, dtype=np.float64))
    return out

Swapping out the last 8 lines or so depending on what I'm calculating. To use a set of marker genes I'd call it as grouped_obs_mean(adata[:, marker_genes], ...).

At some point we might have groupby for AnnDatas, but that'll require figuring out how to be consistent about the returned type.

Thanks. But need to make a tiny amendment to make it work now:

out[group] = np.ravel(X.mean(axis=0, dtype=np.float64)).tolist()