lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.36k stars 799 forks source link

`transform` after `fit_transform` on large dataset #65

Open bccho opened 6 years ago

bccho commented 6 years ago

Hi there,

I've been experimenting with UMAP as an alternative to t-SNE in some neuroscience applications, and I have been very happy with the combination of mathematical rigor and dedication to software engineering in this project. It's a combination very rarely seen. Keep up the fantastic work!

I'm getting a not-very-descriptive numba error in the transform() function in the 0.3dev branch after fitting the model on a large dataset. See error message on this separate gist (too long).

I'm not experienced with numba so I'm finding it hard to interpret the traceback, but from the MWE below, I think the error has something to do with the fact pairwise distances are computed only for small datasets.

MWE:

import numpy as np
from umap import UMAP
data = np.random.randn(10000, 10)

# doesn't work
umap_model = UMAP(verbose=True)
embedded = umap_model.fit_transform(data[0:4096, :])
reembedded = umap_model.transform(data[-100:, :]) # throws error

# does work
umap_model = UMAP(verbose=True)
embedded = umap_model.fit_transform(data[0:4095, :])
reembedded = umap_model.transform(data[-100:, :])

Would you be able to take a look at this, please?

lmcinnes commented 6 years ago

Numba errors are still rather hairy, but I believe they are working to make them more informative. There are a couple of possibilities here, but the short answer is that numba is failing to compile a parallel version of a particular function. One possibility as to why is that I am making use of some fairly new-ish numba features in 0.3dev, so it might be that your version of numba is not quite new enough. If you have 0.36 or newer I think you should be okay, and the problem is something else. The other possibility is simply a bug on my part. The dev branch is (obviously) still under development and I haven't played with the transform code recently, so some other subtle changes may have broken something. If that's the case I am going to have the plead for patience -- these things will all get worked out soon (as I was hoping to get a 0.3 branch out soon), but it may take a little while.

If you really want to try something now then at the very least you can simply remove the parallel execution on the search. The relevant code is at line 151; you want to change it to something like:

def make_initialized_nnd_search(dist, dist_args):
    @numba.njit()
    def initialized_nnd_search(data,
                               indptr,
                               indices,
                               initialization,
                               query_points):

        for i in range(query_points.shape[0]):

            tried = set(initialization[0, i])

            while True:

                # Find smallest flagged vertex
                vertex = smallest_flagged(initialization, i)

                if vertex == -1:
                    break
                candidates = indices[indptr[vertex]:indptr[vertex + 1]]
                for j in range(candidates.shape[0]):
                    if candidates[j] == vertex or candidates[j] == -1 or \
                                    candidates[j] in tried:
                        continue
                    d = dist(data[candidates[j]], query_points[i], *dist_args)
                    unchecked_heap_push(initialization, i, d, candidates[j], 1)
                    tried.add(candidates[j])

        return initialization

    return initialized_nnd_search
bccho commented 6 years ago

I was using numba 0.36.2, but I think upgrading numba to 0.38 resolved the issue. You should probably update your minimum numba requirement, but I eagerly await release 0.3!

lmcinnes commented 6 years ago

Thanks for the feedback as to the version you were on (which I think should have worked) and that upgrading did fix it. I will likely be upgrading the requirements for 0.3. It is getting close; I have some issues with spectral initialization that I really want to improve first (it will help greatly with supervised dimension reduction).