Consider a ll_dirichlet distance calculation for the following two bit vectors:
import numpy as np
from umap.distances import ll_dirichlet
np.random.seed(42)
x = np.random.randint(0,2,100)
y[:] = x
y[-1] = 1 if x[-1]==0 else 0
dist = ll_dirichlet(x,y)
assert not np.isfinite( np.sqrt(a+b) )
a = 1.0 / n2 * (log_b - log_beta(n1, n2) - (self_denom2 - log_single_beta(n2)))
b = 1.0 / n1 * (log_b - log_beta(n2, n1) - (self_denom1 - log_single_beta(n1)))
ab = a+b
return 0. if ab <= 0 else np.sqrt(ab)
If so, I'd be more than happy to submit a PR.
Full disclosure: I am not too familiar with the Dirichlet distribution and whether it can be used for the distance calculation of bit vectors. I tried looking up the algorithm, as used in this package, but could not find any publication that would propose an implementation that is similar to the current one. Would be great to get any leads on where the implementation is coming from.
A colleague built it for a very specific use case; these days we would mostly just use hellinger distance instead. Still, your suggestion makes good sense, so a PR would be welcome.
Consider a ll_dirichlet distance calculation for the following two bit vectors:
dist
will beNaN
because the sum in https://github.com/lmcinnes/umap/blob/dbd8031485271a5ebda8cdc87026488af3e08dc8/umap/distances.py#L754-L757 is <= 0. What would be a correct way to handle this? Is something like this appropriate here:If so, I'd be more than happy to submit a PR.
Full disclosure: I am not too familiar with the Dirichlet distribution and whether it can be used for the distance calculation of bit vectors. I tried looking up the algorithm, as used in this package, but could not find any publication that would propose an implementation that is similar to the current one. Would be great to get any leads on where the implementation is coming from.
Thanks a lot!