Use graph coloring algorithm for a “world map” avoiding to color neighboring clusters in the same color

brianpenghe commented 4 years ago

I was trying to plot a tsne to visualize all my clusters. But since I have 100+ clusters the pl.umap function gave me all grey.

Is there a way to assign colors to them?

flying-sheep commented 4 years ago

There’s no way that many clusters will be discernible by color, you need to find another way to get the information you are interested in. If you just want a pretty picture without use, you can of course set palette to a list that contains >100 colors.

brianpenghe commented 4 years ago

Maybe I was not clear. The goal was not to use a different color for each cell type. The goal was similar to drawing a world map where adjacent clusters have different colors. It doesn't matter if colors are repeated as long as they don't visually merge adjacent clusters. Setting palette abolished the gray dominance and temporarily solved the problem for some local regions. But I guess to have perfectly separated coloring I need to manually pick which color is for which cluster.

flying-sheep commented 4 years ago

Yes, I think there should be algorithms for that. This is a graph coloring problem: You can make a graph from your clustering, where clusters become nodes and touching clusters get an edge between them.

If you find a nice way to implement this, please consider adding it in a PR.

ivirshup commented 4 years ago

On more issue to consider: entities on maps tend to be contiguous. The set of cells in a cluster do not have to be adjacent. How can it be clear two non-adjacent cells are from the same cluster if colors can be repeated?

atarashansky commented 4 years ago

One idea: 1) Cluster the graph with leiden 2) Coarsen the graph (collapse cells in a single cluster into super nodes) 3) Assign each supernode a color -- adjacent supernodes in the coarsened graph cannot be the same color 4) Assign all cells in each cluster that cluster's color.

That way you'd probably get nice-looking patches of colors and wouldn't run into the issue @ivirshup mentioned.

Hrovatin commented 4 years ago

I would be very interested in this functionality as well - even with 20 or so cell types the UMAPs are often unreadable.

Shouldn't the clusters be already predefined (e.g. cell types, etc. used for colouring)? Also, shouldn't one strive that neighbouring clusters are not only coloured with different colour, but also the colours to be as much apart in the spectrum as possible. E.g. orange and brown are easy to distinguish on legend, but when you have overlapping clusters they can not be distinguished anymore.

Also, I think there is something odd with the current default colour palette choice - the colours are more distinct with more clusters than less clusters, see below: 24 clusters: 33 clusters:

brianpenghe commented 4 years ago

On more issue to consider: entities on maps tend to be contiguous. The set of cells in a cluster do not have to be adjacent. How can it be clear two non-adjacent cells are from the same cluster if colors can be repeated?

It won't be as a big problem for two different clusters to have the same colors because Scanpy already uses very similar or identical colors when cell type number is high. The primary goal for using different colors is to separate clusters that sit next to each other on a dimension-reduced 2D map. Hopefully the world map color problem can be solved by the Scanpy team.

brianpenghe commented 4 years ago

One idea:

Cluster the graph with leiden

Coarsen the graph (collapse cells in a single cluster into super nodes)

Assign each supernode a color -- adjacent supernodes in the coarsened graph cannot be the same color

Assign all cells in each cluster that cluster's color.

That way you'd probably get nice-looking patches of colors and wouldn't run into the issue @ivirshup mentioned.

that sounds reasonable

ivirshup commented 3 years ago

Just as an updated thought on this, I don't think using the connectivity graph is the most straightforward way to approach this. We don't really care about closeness of points on the manifold, we care about closeness of the points on the plot.

I think you'd want to constrain color assignment by points on the plot. This has the side benefit of being more widely applicable, since it doesn't require the plot to be connected to some graph representation.

atarashansky commented 3 years ago

You could generate a connectivity graph from the UMAP projection using the euclidean distance metric and then do the graph coloring.

atarashansky commented 3 years ago

Here’s one potential solution. Not sure if it works as intended or not. The graph_coloring function returns a numpy array of normalized RGB values for each cell in adata. Sorry for the ugly list comprehensions at the end — I can clean it up if this is a solution that is worth pursuing.

1) generate a nearest neighbor graph from the UMAP coordinates using the euclidean distance metric 2) coarsen the graph using provided cluster assignments 3) do the graph coloring, resulting in N distinct colors 4) generate N visually distinct colors 5) map the colors to each cell

Sources: Choosing N visually distinct colors Graph coloring


# Choosing N distinct colors
from typing import Iterable, Tuple
import colorsys
import itertools
from fractions import Fraction

def zenos_dichotomy() -> Iterable[Fraction]:
    """
    http://en.wikipedia.org/wiki/1/2_%2B_1/4_%2B_1/8_%2B_1/16_%2B_%C2%B7_%C2%B7_%C2%B7
    """
    for k in itertools.count():
        yield Fraction(1,2**k)

def fracs() -> Iterable[Fraction]:
    """
    [Fraction(0, 1), Fraction(1, 2), Fraction(1, 4), Fraction(3, 4), Fraction(1, 8), Fraction(3, 8), Fraction(5, 8), Fraction(7, 8), Fraction(1, 16), Fraction(3, 16), ...]
    [0.0, 0.5, 0.25, 0.75, 0.125, 0.375, 0.625, 0.875, 0.0625, 0.1875, ...]
    """
    yield Fraction(0)
    for k in zenos_dichotomy():
        i = k.denominator # [1,2,4,8,16,...]
        for j in range(1,i,2):
            yield Fraction(j,i)

# can be used for the v in hsv to map linear values 0..1 to something that looks equidistant
# bias = lambda x: (math.sqrt(x/3)/Fraction(2,3)+Fraction(1,3))/Fraction(6,5)

HSVTuple = Tuple[Fraction, Fraction, Fraction]
RGBTuple = Tuple[float, float, float]

def hue_to_tones(h: Fraction) -> Iterable[HSVTuple]:
    for s in [Fraction(6,10)]: # optionally use range
        for v in [Fraction(8,10),Fraction(5,10)]: # could use range too
            yield (h, s, v) # use bias for v here if you use range

def hsv_to_rgb(x: HSVTuple) -> RGBTuple:
    return colorsys.hsv_to_rgb(*map(float, x))

flatten = itertools.chain.from_iterable

def hsvs() -> Iterable[HSVTuple]:
    return flatten(map(hue_to_tones, fracs()))

def rgbs() -> Iterable[RGBTuple]:
    return map(hsv_to_rgb, hsvs())

def rgb_to_css(x: RGBTuple) -> str:
    uint8tuple = map(lambda y: int(y*255), x)
    return "rgb({},{},{})".format(*uint8tuple)

def css_colors() -> Iterable[str]:
    return map(rgb_to_css, rgbs())

# Graph coloring
def color_nodes(nnm):
    color_map = {}
    graph={}
    for i in range(nnm.shape[0]):
        graph[i] = list(nnm[i].nonzero()[0])

    # Consider nodes in descending degree 
    for node in sorted(graph, key=lambda x: len(graph[x]), reverse=True):
        neighbor_colors = set(color_map.get(neigh) for neigh in graph[node])
        color_map[node] = next( 
            color for color in range(len(graph)) if color not in neighbor_colors
        )
    return color_map

def graph_coloring(adata, cluster_key):
    import scanpy as sc
    import numpy as np
    import pandas as pd

    thr = 0.0

    sc.pp.neighbors(adata,metric='euclidean',use_rep='X_umap',key_added='umap_knn')
    nnm = adata.obsp['umap_knn_connectivities']

    cl = np.array(list(adata.obs[cluster_key]))

    clu,iv = np.unique(cl,return_inverse=True)
    A = pd.Series(data = np.arange(clu.size),index = clu)

    x,y = nnm.nonzero()
    pairs,counts = np.unique(np.vstack((cl[x],cl[y])).T,return_counts=True,axis=0)

    nnm_coarse = np.zeros((clu.size,)*2)

    nnm_coarse[A[pairs[:,0]].values,A[pairs[:,1]].values] = counts
    nnm_coarse = nnm_coarse / nnm_coarse.sum(1)[:,None]
    nnm_coarse[nnm_coarse<thr]=0

    color_map = color_nodes(nnm_coarse)
    N = np.unique(list(color_map.values())).size
    colors = list(itertools.islice(css_colors(), N))
    colors = np.vstack([[int(x)/255 for x in colors[i].split('(')[-1].split(')')[0].split(',')] for i in range(len(colors))])
    d = dict(zip(list(np.arange(N)),colors))
    colors = dict(zip(list(color_map.keys()),[d[x] for x in list(color_map.values())]))
    colors = np.vstack([colors[i] for i in np.arange(nnm_coarse.shape[0])])[iv]
    return colors

brianpenghe commented 3 years ago

wow. I was thinking...maybe the PAGA graph can be used as the simplified graph for color assignment. But will try your solution here!

ivirshup commented 3 years ago

@atarashansky, that's great! I've built on this a bit, with some points about graph construction

The problems isn't really nearest K neighbors, it's radius neighbors
Embedding position is relevant, but I think position in pixel space is more important in this case. You could imagine one dimension of your embedding plot having a range of (0, 100) while the other has (0, 1), but displayed on a square plot. Euclidean distance for neighboring points would not be particularly effective here.

To that end, here's some code for building the cluster connectivity graph from pixel space:

Code:

```python # imports import datashader as ds from datashader import transfer_functions as tf import matplotlib.pyplot as plt from natsort import natsorted import scanpy as sc import numpy as np import pandas as pd import xarray as xr from sklearn.neighbors import RadiusNeighborsTransformer # example data # 1.3 million cells, 38 clusters by louvain adata = sc.read("/Users/isaac/data/10x_mouse_13MM_processed.h5ad", backed="r") # Previous colors louvain_colors_old = dict( zip( adata.obs["louvain"].cat.categories, adata.uns["louvain_colors"] ) ) del adata.uns["louvain_colors"] louvain_colors_current = sc.pl._tools.scatterplots._get_palette(adata, values_key="louvain") df = sc.get.obs_df( adata, ["louvain"], obsm_keys=[("X_umap", 0), ("X_umap", 1)] ) # pixel x pixel x n_clusters array (500, 500, 38) # Each value is the count of cells in the cluster at this pixel pts = ( ds.Canvas(500, 500) .points(df, "X_umap-0", "X_umap-1", agg=ds.count_cat("louvain")) ) def pts_to_coords(pts: xr.DataArray) -> pd.DataFrame: """ Turning pixel data into "long" sparse format. One entry for each cluster, for each pixel it would occur in. """ coords = pd.DataFrame(np.argwhere(np.asarray(pts)), columns=["x", "y", "cat"]) coords["cat"] = pd.Categorical.from_codes(coords["cat"], categories=list(pts.coords.values())[2]) return coords def color_nodes(graph: dict) -> dict: """Graph coloring algorithm From @atarashansky: https://github.com/theislab/scanpy/issues/1366#issuecomment-763066249 """ color_map = {} for node in sorted(graph, key=lambda x: len(graph[x]), reverse=True): neighbor_colors = set(color_map.get(neigh) for neigh in graph[node]) color_map[node] = next( color for color in range(len(graph)) if color not in neighbor_colors ) return color_map def neighboring_clusters(pts: xr.DataArray) -> dict: """From array of (pixel, pixel, cluster) find which clusters neighbor eachother. """ graph = {} coords = pts_to_coords(pts) cat = coords["cat"] radius_neighbor = RadiusNeighborsTransformer(metric="euclidean", radius=1) radius_neighbor.fit(coords.values[:, :2]) g = radius_neighbor.radius_neighbors_graph() for k, v in coords.groupby("cat").indices.items(): neighbors = np.unique(g[v].indices) graph[k] = natsorted(pd.unique(cat[neighbors])) return graph graph = neighboring_clusters(pts) palette = {k: sc.pl.palettes.default_28[v] for k, v in color_nodes(graph).items()} ```

Results:

tf.Images(
    tf.shade(pts, color_key=louvain_colors_current, name="Current scanpy coloring (38 unique colors out of 100)"),
    tf.shade(pts, color_key=palette, name="Graph coloring (25 unique colors)"),
    tf.shade(pts, color_key=louvain_colors_old, name="Old scanpy coloring (20 colors w/ repeats)"),
)

Additional plot using color generating code from above:

code

```python from matplotlib.colors import to_hex colors = list(itertools.islice(rgbs(), len(set(color_map.values())))) tarashansky_palette = {k: to_hex(colors[v]) for k, v in color_map.items()} tf.shade(pts, color_key=tarashansky_palette) ```

I think it's good, but there's room for improvement. I was initially worried about this example since the cluster connectivity graph is highly connected, but this seems to have worked out alright. Overplotting (points sitting on top of eachother) is still a problem, but I think that's a separate problem from choosing colors.

Even if they do not touch, can we make it so clusters close to each-other are less likely to get similar colors? This would start becoming more of an optimization problem, and more complicated.
Picking a color palette where all colors are very visible is important. (maybe adding a light border when there's nothing in the background? maybe something similar to "player model contrast boost" shaders?)
I think there are some "spurious" connections in the graph. Many clusters have >20 neighbors. I think this has to do with outlier points and dispersed points. Not completely sure how to deal with this. Maybe less of an issue with smaller datasets/ leiden clustering?
When categories are disconnected, how do we indicate they're the same category if unique color is no longer an option? Do we require disconnected categories be uniquely assigned a color? Is this a case for interactivity?

I've been thinking that it might be worth starting a package for dealing with common issues in plotting single cell data. Largely involving color assignment and overplotting. I think this should be, or at least start out as, separate from scanpy since there are a number of dependencies I think are useful here, which aren't required for scanpy. Plus being able to iterate quickly would be nice. I've started collecting some notebooks on this here.

brianpenghe commented 3 years ago

I tried your codes. It nicely separated neighbor clusters. However, it seemed that some colors have been used multiple times while other colors were less favored. Is that true?

ivirshup commented 3 years ago

Yeah, looking at the code, the color used will just be the first item in the color list that hasn't been assigned to a neighbor yet. Maybe you could replace color selection to pick from a queue prioritizing less used colors? This would be less likely to converge to a minimal number of colors though. Might need to consider a different algorithm for this property.

Example

``` A / C - B ``` Lets say we have two colors ["red", "blue"] and we want to assign them. We iterate in order [A, B, C] * If we don't use the priority queue, we'd end up with `{"A": "red", "B": "red", "C": "blue"}` * If we do, we'd fail to converge after ending up at {"A": "red", "B": "blue"}`

Do you think you could share some part of the object you're plotting? I think it would make for a good use case to play around with plotting.

brianpenghe commented 3 years ago

OK I'll prepare an object.

brianpenghe commented 3 years ago

Here's a toy example. The embedding and category are both there. Thanks colortest.h5ad.zip

scverse / scanpy

Use graph coloring algorithm for a “world map” avoiding to color neighboring clusters in the same color #1366