Open brianpenghe opened 4 years ago
There’s no way that many clusters will be discernible by color, you need to find another way to get the information you are interested in. If you just want a pretty picture without use, you can of course set palette
to a list that contains >100 colors.
Maybe I was not clear. The goal was not to use a different color for each cell type. The goal was similar to drawing a world map where adjacent clusters have different colors. It doesn't matter if colors are repeated as long as they don't visually merge adjacent clusters. Setting palette abolished the gray dominance and temporarily solved the problem for some local regions. But I guess to have perfectly separated coloring I need to manually pick which color is for which cluster.
Yes, I think there should be algorithms for that. This is a graph coloring problem: You can make a graph from your clustering, where clusters become nodes and touching clusters get an edge between them.
If you find a nice way to implement this, please consider adding it in a PR.
On more issue to consider: entities on maps tend to be contiguous. The set of cells in a cluster do not have to be adjacent. How can it be clear two non-adjacent cells are from the same cluster if colors can be repeated?
One idea: 1) Cluster the graph with leiden 2) Coarsen the graph (collapse cells in a single cluster into super nodes) 3) Assign each supernode a color -- adjacent supernodes in the coarsened graph cannot be the same color 4) Assign all cells in each cluster that cluster's color.
That way you'd probably get nice-looking patches of colors and wouldn't run into the issue @ivirshup mentioned.
I would be very interested in this functionality as well - even with 20 or so cell types the UMAPs are often unreadable.
Shouldn't the clusters be already predefined (e.g. cell types, etc. used for colouring)? Also, shouldn't one strive that neighbouring clusters are not only coloured with different colour, but also the colours to be as much apart in the spectrum as possible. E.g. orange and brown are easy to distinguish on legend, but when you have overlapping clusters they can not be distinguished anymore.
Also, I think there is something odd with the current default colour palette choice - the colours are more distinct with more clusters than less clusters, see below: 24 clusters: 33 clusters:
On more issue to consider: entities on maps tend to be contiguous. The set of cells in a cluster do not have to be adjacent. How can it be clear two non-adjacent cells are from the same cluster if colors can be repeated?
It won't be as a big problem for two different clusters to have the same colors because Scanpy already uses very similar or identical colors when cell type number is high. The primary goal for using different colors is to separate clusters that sit next to each other on a dimension-reduced 2D map. Hopefully the world map color problem can be solved by the Scanpy team.
One idea:
- Cluster the graph with leiden
- Coarsen the graph (collapse cells in a single cluster into super nodes)
- Assign each supernode a color -- adjacent supernodes in the coarsened graph cannot be the same color
- Assign all cells in each cluster that cluster's color.
That way you'd probably get nice-looking patches of colors and wouldn't run into the issue @ivirshup mentioned.
that sounds reasonable
Just as an updated thought on this, I don't think using the connectivity graph is the most straightforward way to approach this. We don't really care about closeness of points on the manifold, we care about closeness of the points on the plot.
I think you'd want to constrain color assignment by points on the plot. This has the side benefit of being more widely applicable, since it doesn't require the plot to be connected to some graph representation.
You could generate a connectivity graph from the UMAP projection using the euclidean distance metric and then do the graph coloring.
Here’s one potential solution. Not sure if it works as intended or not. The graph_coloring
function returns a numpy array of normalized RGB values for each cell in adata
. Sorry for the ugly list comprehensions at the end — I can clean it up if this is a solution that is worth pursuing.
1) generate a nearest neighbor graph from the UMAP coordinates using the euclidean distance metric 2) coarsen the graph using provided cluster assignments 3) do the graph coloring, resulting in N distinct colors 4) generate N visually distinct colors 5) map the colors to each cell
Sources: Choosing N visually distinct colors Graph coloring
# Choosing N distinct colors
from typing import Iterable, Tuple
import colorsys
import itertools
from fractions import Fraction
def zenos_dichotomy() -> Iterable[Fraction]:
"""
http://en.wikipedia.org/wiki/1/2_%2B_1/4_%2B_1/8_%2B_1/16_%2B_%C2%B7_%C2%B7_%C2%B7
"""
for k in itertools.count():
yield Fraction(1,2**k)
def fracs() -> Iterable[Fraction]:
"""
[Fraction(0, 1), Fraction(1, 2), Fraction(1, 4), Fraction(3, 4), Fraction(1, 8), Fraction(3, 8), Fraction(5, 8), Fraction(7, 8), Fraction(1, 16), Fraction(3, 16), ...]
[0.0, 0.5, 0.25, 0.75, 0.125, 0.375, 0.625, 0.875, 0.0625, 0.1875, ...]
"""
yield Fraction(0)
for k in zenos_dichotomy():
i = k.denominator # [1,2,4,8,16,...]
for j in range(1,i,2):
yield Fraction(j,i)
# can be used for the v in hsv to map linear values 0..1 to something that looks equidistant
# bias = lambda x: (math.sqrt(x/3)/Fraction(2,3)+Fraction(1,3))/Fraction(6,5)
HSVTuple = Tuple[Fraction, Fraction, Fraction]
RGBTuple = Tuple[float, float, float]
def hue_to_tones(h: Fraction) -> Iterable[HSVTuple]:
for s in [Fraction(6,10)]: # optionally use range
for v in [Fraction(8,10),Fraction(5,10)]: # could use range too
yield (h, s, v) # use bias for v here if you use range
def hsv_to_rgb(x: HSVTuple) -> RGBTuple:
return colorsys.hsv_to_rgb(*map(float, x))
flatten = itertools.chain.from_iterable
def hsvs() -> Iterable[HSVTuple]:
return flatten(map(hue_to_tones, fracs()))
def rgbs() -> Iterable[RGBTuple]:
return map(hsv_to_rgb, hsvs())
def rgb_to_css(x: RGBTuple) -> str:
uint8tuple = map(lambda y: int(y*255), x)
return "rgb({},{},{})".format(*uint8tuple)
def css_colors() -> Iterable[str]:
return map(rgb_to_css, rgbs())
# Graph coloring
def color_nodes(nnm):
color_map = {}
graph={}
for i in range(nnm.shape[0]):
graph[i] = list(nnm[i].nonzero()[0])
# Consider nodes in descending degree
for node in sorted(graph, key=lambda x: len(graph[x]), reverse=True):
neighbor_colors = set(color_map.get(neigh) for neigh in graph[node])
color_map[node] = next(
color for color in range(len(graph)) if color not in neighbor_colors
)
return color_map
def graph_coloring(adata, cluster_key):
import scanpy as sc
import numpy as np
import pandas as pd
thr = 0.0
sc.pp.neighbors(adata,metric='euclidean',use_rep='X_umap',key_added='umap_knn')
nnm = adata.obsp['umap_knn_connectivities']
cl = np.array(list(adata.obs[cluster_key]))
clu,iv = np.unique(cl,return_inverse=True)
A = pd.Series(data = np.arange(clu.size),index = clu)
x,y = nnm.nonzero()
pairs,counts = np.unique(np.vstack((cl[x],cl[y])).T,return_counts=True,axis=0)
nnm_coarse = np.zeros((clu.size,)*2)
nnm_coarse[A[pairs[:,0]].values,A[pairs[:,1]].values] = counts
nnm_coarse = nnm_coarse / nnm_coarse.sum(1)[:,None]
nnm_coarse[nnm_coarse<thr]=0
color_map = color_nodes(nnm_coarse)
N = np.unique(list(color_map.values())).size
colors = list(itertools.islice(css_colors(), N))
colors = np.vstack([[int(x)/255 for x in colors[i].split('(')[-1].split(')')[0].split(',')] for i in range(len(colors))])
d = dict(zip(list(np.arange(N)),colors))
colors = dict(zip(list(color_map.keys()),[d[x] for x in list(color_map.values())]))
colors = np.vstack([colors[i] for i in np.arange(nnm_coarse.shape[0])])[iv]
return colors
wow. I was thinking...maybe the PAGA graph can be used as the simplified graph for color assignment. But will try your solution here!
@atarashansky, that's great! I've built on this a bit, with some points about graph construction
To that end, here's some code for building the cluster connectivity graph from pixel space:
Results:
tf.Images(
tf.shade(pts, color_key=louvain_colors_current, name="Current scanpy coloring (38 unique colors out of 100)"),
tf.shade(pts, color_key=palette, name="Graph coloring (25 unique colors)"),
tf.shade(pts, color_key=louvain_colors_old, name="Old scanpy coloring (20 colors w/ repeats)"),
)
Additional plot using color generating code from above:
I think it's good, but there's room for improvement. I was initially worried about this example since the cluster connectivity graph is highly connected, but this seems to have worked out alright. Overplotting (points sitting on top of eachother) is still a problem, but I think that's a separate problem from choosing colors.
I've been thinking that it might be worth starting a package for dealing with common issues in plotting single cell data. Largely involving color assignment and overplotting. I think this should be, or at least start out as, separate from scanpy since there are a number of dependencies I think are useful here, which aren't required for scanpy. Plus being able to iterate quickly would be nice. I've started collecting some notebooks on this here.
I tried your codes. It nicely separated neighbor clusters. However, it seemed that some colors have been used multiple times while other colors were less favored. Is that true?
Yeah, looking at the code, the color used will just be the first item in the color list that hasn't been assigned to a neighbor yet. Maybe you could replace color selection to pick from a queue prioritizing less used colors? This would be less likely to converge to a minimal number of colors though. Might need to consider a different algorithm for this property.
Do you think you could share some part of the object you're plotting? I think it would make for a good use case to play around with plotting.
OK I'll prepare an object.
Here's a toy example. The embedding and category are both there. Thanks colortest.h5ad.zip
I was trying to plot a tsne to visualize all my clusters. But since I have 100+ clusters the pl.umap function gave me all grey.
Is there a way to assign colors to them?