Closed asemic-horizon closed 6 years ago
Can you please clarify what do you mean by "can't support Numba"? This package uses Numba to accelerate the internal computations in the case that the "fast distance covariance algorithm" can be used instead of the original O(N^2) algorithm.
Hi. First let me apologize for the tone of that question. Besides being vague, it comes off as really rude. I don't know how that came from me. I wanted to know if there was something marginal that I could fix and make everything work.
Second, I can't reproduce the problem for arbitrary data. The problem doesn't happen with iris (which is small, 150 rows x 4 cols), but it does repeat with fashion-mnist (which is larger, on the order of tens of thousands of rows and (28*28) columns -- but not huge either.)
I came to believe that this was a dcor
problem because I had used other custom metrics succesfully -- but, as it turns out, not in datasets as large as fashion-mnist.
I'm including some details on reproduction attempts but it's not clearly a dcor
problem; it's probably the approximated-nearest-neighbors algorithm UMAP uses.
First, since UMAP uses the idea of nearest-neighbors (although it doesn't use the stock/exact algorithm), we try the following (code 1), which works
from sklearn.datasets import load_iris
iris = load_iris()
from sklearn.neighbors import kneighbors_graph
from numba import jit
from dcor import distance_correlation
@jit
def distcor(x,y):
return 1 - distance_correlation(x,y)
g = kneighbors_graph(iris.data, 2, mode = 'distance', metric='pyfunc',
metric_params = {'func': distcor})
Attempting to run this for fashion-mnist takes >20 minutes (the algorithm is expected to explode with large datasets anyway)-- I've given up before errors came up.
The following (code 2) runs UMAP itself. And works.
from umap import UMAP
embedding = UMAP(metric = distcor, n_neighbors = 4).fit_transform(iris.data)
Code 2 for fashion-mnist fails very loudly.
TypingError: Failed at nopython (nopython frontend)
Invalid usage of type(CPUDispatcher(<function distcor at 0x000001FC83B2B378>)) with parameters (array(float32, 1d, C), array(float32, 1d, C))
* parameterized
[1] During: resolving callee type: type(CPUDispatcher(<function distcor at 0x000001FC83B2B378>))
[2] During: typing of call at C:\Users\Diego Navarro - FGV\Anaconda3b\lib\site-packages\umap\nndescent.py (65)
File "..\..\Anaconda3b\lib\site-packages\umap\nndescent.py", line 65:
def nn_descent(
<source elided>
for j in range(indices.shape[0]):
d = dist(data[i], data[indices[j]], *dist_args)
^
This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.
To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/dev/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html
For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile
If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new
So third, I made some experiments with the pynndescent
library, which appears to be a close cousin to the nndescent used inside the UMAP library. Surprisingly this doesn't work even for iris.
from pynndescent import NNDescent
index = NNDescent(iris.data, metric = distcor)
u,_=index.query(data,k=1)
Since the UMAP and pynndescent share maintainers, I probably should take that up with them.
Ok, thank you for the clarification. I did not see your question as rude at all, but I am not a native speaker. When I read your question I though that you were asking for GPU or compiled versions of the distance covariance/correlation functions via Numba. This is something I think is useful for speeding computations, but I do not have time to implement right now. However, if I have understood your answer correctly, your problem lied elsewere. I hope you can find and fix it easily.
I'm trying to use distance correlations as a metric for computing UMAP embeddings. This requires Numba support.
Is there a fundamental reason why dcor.correlation_distance can't support Numba, or is it just a matter of going over the code?