lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.39k stars 803 forks source link

Example Using Custom and External KNN Graph #464

Open turnersr opened 4 years ago

turnersr commented 4 years ago

Hello,

I tried to find an example of using external KNN graphs with UMAP. I wasn't able to find any so I created one and hope people find this snippet of code helpful. It's largely based on the internal APIs and functions within UMAP itself, but it works well for our purposes. An example of how to use the code is at the end and here is the example data from the Iris dataset: approximate-knng.txt

Any feedback would be welcomed.

Take care!

from umap.umap_ import compute_membership_strengths, smooth_knn_dist, make_epochs_per_sample, simplicial_set_embedding, find_ab_params
import scipy

# Load graphs created by NGT
def load_knn_graph(ngt_graph):
    knn_indices = []
    knn_dist = []

    nodes_with_no_edges = set()
    with open(ngt_graph, "r") as f:
        while True:
            line = f.readline().strip("\n").strip()
            if not line:
                break
            token = line.split("\t")
            start_node = int(token[0])
            end_nodes = token[1:]

            if end_nodes == []:
                nodes_with_no_edges.add(start_node)
                continue

            knn_entry = []
            knn_entry_dist = []
            for ith in range(0, len(end_nodes), 2):
                try:
                    end_node = int(end_nodes[ith])
                except Exception as e:
                    nodes_with_no_edges.add(start_node)
                    continue
                knn_entry.append(end_node)
                weight = float(end_nodes[ith+1])
                knn_entry_dist.append(weight)
            knn_indices.append(knn_entry)
            knn_dist.append(knn_entry_dist)

    return np.array(knn_indices), np.array(knn_dist)

def knn_fuzzy_simplicial_set(knn_indices, knn_dists, local_connectivity, set_op_mix_ratio, apply_set_operations):

    n_samples = knn_indices.shape[0]
    n_neighbors = knn_indices.shape[1]

    knn_dists = knn_dist.astype(np.float32)

    sigmas, rhos = smooth_knn_dist(
        knn_dists, float(n_neighbors), local_connectivity=float(local_connectivity),
    )

    rows, cols, vals = compute_membership_strengths(
        knn_indices, knn_dists, sigmas, rhos
    )

    result = scipy.sparse.coo_matrix(
        (vals, (rows, cols)), shape=(n_samples, n_samples)
    )
    result.eliminate_zeros()

    if apply_set_operations:
        transpose = result.transpose()

        prod_matrix = result.multiply(transpose)

        result = (
            set_op_mix_ratio * (result + transpose - prod_matrix)
            + (1.0 - set_op_mix_ratio) * prod_matrix
        )

    result.eliminate_zeros()

    return result, sigmas, rhos

def transform(graph, metric="l2", n_components = 2, n_epochs = 500, spread=1.0, min_dist = 0.0, initial_alpha=1.0, negative_sample_rate=5, repulsion_strength=7):
    graph = graph.tocoo()
    graph.sum_duplicates()
    n_vertices = graph.shape[1]

    if n_epochs <= 0:
        # For smaller datasets we can use more epochs
        if graph.shape[0] <= 10000:
            n_epochs = 500
        else:
            n_epochs = 200

    graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0
    graph.eliminate_zeros()

    epochs_per_sample = make_epochs_per_sample(graph.data, n_epochs)

    head = graph.row
    tail = graph.col
    weight = graph.data

    a, b = find_ab_params(spread, min_dist)

    emebedding = simplicial_set_embedding(None, graph, n_components=2, initial_alpha=1.0, a=a, b=b, gamma=repulsion_strength, negative_sample_rate=negative_sample_rate, random_state=np.random, metric=metric, metric_kwds=None, verbose=True, parallel=True, n_epochs=n_epochs, init="random")

    return emebedding

An example usage is a follows:

import numpy as np

knn_indices, knn_dist = load_knn_graph("approximate-knng.txt")

knn_indices = knn_indices - 1

local_connectivity = 1 
apply_set_operations = True
set_op_mix_ratio=1.0       

graph, sigmas, rhos = knn_fuzzy_simplicial_set(knn_indices, knn_dist, local_connectivity, set_op_mix_ratio, apply_set_operations)

emebedding = transform(graph, metric="l2", n_components = 2, n_epochs = 500, spread=1.0, min_dist = 0.0, initial_alpha=1.0, negative_sample_rate=5, repulsion_strength=7)
lmcinnes commented 4 years ago

Thanks. I can potentially add this to the FAQ if you like.

turnersr commented 4 years ago

Oh, that would be wonderful. Thank you! Let me share how I made the graphs, too. I use a tool called NGT. It can export several different graph approximations. Two of the less well known ones are ANNG and CkNN.

https://arxiv.org/abs/1606.02353

https://arxiv.org/abs/1810.07355

A development branch of NGT located at https://github.com/masajiro/NGT/tree/devel . Let's pull it down and install it.

git clone https://github.com/masajiro/NGT.git NGT_DEV 
cd NGT_DEV/
git checkout devel
 /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install cmake brew install gcc@9
export CXX=/usr/local/bin/g++-9 export CC=/usr/local/bin/gcc-9 
mkdir build 
cd build 
cmake .. 
make 
make install

Once you make index you can export using ngt export-graph -k 15 refined-anng > approximate-knng.graph

atarashansky commented 4 years ago

For the record, I think the implementation of UMAP in scanpy can do this (although initially intended for single-cell data, lol!). You can tell it to layout whatever graph you like into 2D, even those that were not calculated using UMAP's fuzzy simplicial set implementation (such as vanilla k-nearest neighbor graphs).