lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.41k stars 805 forks source link

Handle complex solutions for the Dirichlet distance. #928

Open oiao opened 2 years ago

oiao commented 2 years ago

Consider a ll_dirichlet distance calculation for the following two bit vectors:

import numpy as np
from umap.distances import ll_dirichlet
np.random.seed(42)
x = np.random.randint(0,2,100)
y[:] = x
y[-1] = 1 if x[-1]==0 else 0
dist = ll_dirichlet(x,y)

assert not np.isfinite( np.sqrt(a+b) )

dist will be NaN because the sum in https://github.com/lmcinnes/umap/blob/dbd8031485271a5ebda8cdc87026488af3e08dc8/umap/distances.py#L754-L757 is <= 0. What would be a correct way to handle this? Is something like this appropriate here:

    a = 1.0 / n2 * (log_b - log_beta(n1, n2) - (self_denom2 - log_single_beta(n2)))
    b = 1.0 / n1 * (log_b - log_beta(n2, n1) - (self_denom1 - log_single_beta(n1)))
    ab = a+b
    return 0. if ab <= 0 else np.sqrt(ab)

If so, I'd be more than happy to submit a PR.

Full disclosure: I am not too familiar with the Dirichlet distribution and whether it can be used for the distance calculation of bit vectors. I tried looking up the algorithm, as used in this package, but could not find any publication that would propose an implementation that is similar to the current one. Would be great to get any leads on where the implementation is coming from.

Thanks a lot!

lmcinnes commented 2 years ago

A colleague built it for a very specific use case; these days we would mostly just use hellinger distance instead. Still, your suggestion makes good sense, so a PR would be welcome.