vnmabus / dcor

Distance correlation and related E-statistics in Python
https://dcor.readthedocs.io
MIT License
144 stars 26 forks source link

Numba support #2

Closed asemic-horizon closed 6 years ago

asemic-horizon commented 6 years ago

I'm trying to use distance correlations as a metric for computing UMAP embeddings. This requires Numba support.

Is there a fundamental reason why dcor.correlation_distance can't support Numba, or is it just a matter of going over the code?

vnmabus commented 6 years ago

Can you please clarify what do you mean by "can't support Numba"? This package uses Numba to accelerate the internal computations in the case that the "fast distance covariance algorithm" can be used instead of the original O(N^2) algorithm.

asemic-horizon commented 6 years ago

Hi. First let me apologize for the tone of that question. Besides being vague, it comes off as really rude. I don't know how that came from me. I wanted to know if there was something marginal that I could fix and make everything work.

Second, I can't reproduce the problem for arbitrary data. The problem doesn't happen with iris (which is small, 150 rows x 4 cols), but it does repeat with fashion-mnist (which is larger, on the order of tens of thousands of rows and (28*28) columns -- but not huge either.)

I came to believe that this was a dcor problem because I had used other custom metrics succesfully -- but, as it turns out, not in datasets as large as fashion-mnist.

I'm including some details on reproduction attempts but it's not clearly a dcor problem; it's probably the approximated-nearest-neighbors algorithm UMAP uses.


First, since UMAP uses the idea of nearest-neighbors (although it doesn't use the stock/exact algorithm), we try the following (code 1), which works

from sklearn.datasets import load_iris
iris = load_iris()

from sklearn.neighbors import kneighbors_graph
from numba import jit
from dcor import distance_correlation

@jit
def distcor(x,y):
    return 1 - distance_correlation(x,y)

g = kneighbors_graph(iris.data, 2, mode = 'distance', metric='pyfunc',
            metric_params = {'func': distcor})

Attempting to run this for fashion-mnist takes >20 minutes (the algorithm is expected to explode with large datasets anyway)-- I've given up before errors came up.

The following (code 2) runs UMAP itself. And works.

from umap import UMAP
embedding = UMAP(metric = distcor, n_neighbors = 4).fit_transform(iris.data)

Code 2 for fashion-mnist fails very loudly.

TypingError: Failed at nopython (nopython frontend)
Invalid usage of type(CPUDispatcher(<function distcor at 0x000001FC83B2B378>)) with parameters (array(float32, 1d, C), array(float32, 1d, C))
 * parameterized
[1] During: resolving callee type: type(CPUDispatcher(<function distcor at 0x000001FC83B2B378>))
[2] During: typing of call at C:\Users\Diego Navarro - FGV\Anaconda3b\lib\site-packages\umap\nndescent.py (65)

File "..\..\Anaconda3b\lib\site-packages\umap\nndescent.py", line 65:
    def nn_descent(
        <source elided>
            for j in range(indices.shape[0]):
                d = dist(data[i], data[indices[j]], *dist_args)
                ^

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/dev/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html

For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile

If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new

So third, I made some experiments with the pynndescent library, which appears to be a close cousin to the nndescent used inside the UMAP library. Surprisingly this doesn't work even for iris.

from pynndescent import NNDescent
index = NNDescent(iris.data, metric = distcor)
u,_=index.query(data,k=1)

Since the UMAP and pynndescent share maintainers, I probably should take that up with them.

vnmabus commented 6 years ago

Ok, thank you for the clarification. I did not see your question as rude at all, but I am not a native speaker. When I read your question I though that you were asking for GPU or compiled versions of the distance covariance/correlation functions via Numba. This is something I think is useful for speeding computations, but I do not have time to implement right now. However, if I have understood your answer correctly, your problem lied elsewere. I hope you can find and fix it easily.