theislab / scvelo

RNA Velocity generalized through dynamical modeling
https://scvelo.org
BSD 3-Clause "New" or "Revised" License
408 stars 103 forks source link

`color_map` not considered for categorical data #720

Open WeilerP opened 2 years ago

WeilerP commented 2 years ago

Plotting data and color it using categorical/string values (e.g. scv.pl.scatter, scv.pl.velocity_embedding_stream, etc.), the argument color_map is ignored. If the colors have not yet been defined in .uns, the tab10 color map is always used.

import scvelo as scv
import numpy as np

adata = scv.datasets.simulation(alpha=5, beta=0.5, gamma=0.2, n_obs=100)
adata.obs['cell_type'] = np.nan
adata.obs.loc[:20, 'cell_type'] = 'type 0'
adata.obs.loc[20:40, 'cell_type'] = 'type 1'
adata.obs.loc[40:60, 'cell_type'] = 'type 2'
adata.obs.loc[60:80, 'cell_type'] = 'type 3'
adata.obs.loc[80:100, 'cell_type'] = 'type 4'

scv.pl.scatter(adata, basis='0', color='cell_type', color_map='Set3')
Versions ```pytb scvelo==0.2.4.dev48+gba75c4a.d20210719 scanpy==1.7.2 anndata==0.7.6 loompy==3.0.6 numpy==1.20.3 scipy==1.6.3 matplotlib==3.4.2 sklearn==0.24.2 pandas==1.2.4 ```
WeilerP commented 2 years ago

As a workaround, the data can be stored using numerical values. In this case, specifying color_map works.

wes-lewis commented 2 years ago

As a workaround, the data can be stored using numerical values. In this case, specifying color_map works.

I'm not sure I understand how this would be the desired behavior. I understand that using numbers 1:n instead of cluster names seems to allow color_map to work as desired, but then adding the legend (either on top of or beside the data) would likewise output the numbers 1:n instead of the cluster labels. Wouldn't the desired behavior be for color maps to still work with a categorical variable, allowing the use of maps like sets1-3 for categorical vars to be used?

Also, could you provide code as to how one would define the colors manually, as you've mentioned in #719?

WeilerP commented 2 years ago

@wes-lewis, yes it should also work with categorical/string data. Since it does not at the moment, though, you could use the proposed workaround. I think there is a matplotlib function to update the legend of a plot but you'd have to investigate this yourself. Regarding how to define the numerical IDs: You can simply do

adata.obs['cell_type_num_id'] = adata.obs['cell_type'].replace({old_value_0: new_value_0, ...})
therealgenna commented 2 years ago

It's been a while since I did that, so it may be irrelevant. But here's what I did:

# https://seaborn.pydata.org/tutorial/color_palettes.html
import seaborn as sns 
my_palette='Spectral'

# create custom ordered category type for stage_group, for better coloring:
# (categories should be the real names present in stage_group)
stage_group_dtype=pd.CategoricalDtype(categories=['A','B','C'],  ordered=True)

obj.obs.stage_group=obj.obs.stage_group.astype(stage_group_dtype)

scv.pl.velocity_embedding_stream(obj, basis='umap', title=name+': stage_group', 
        legend_loc='right margin', color='stage_group',
        palette=sns.color_palette(palette=my_palette, n_colors=len(obj.obs.stage_group.unique())))