Handle complex solutions for the Dirichlet distance.

Consider a ll_dirichlet distance calculation for the following two bit vectors:

import numpy as np
from umap.distances import ll_dirichlet
np.random.seed(42)
x = np.random.randint(0,2,100)
y[:] = x
y[-1] = 1 if x[-1]==0 else 0
dist = ll_dirichlet(x,y)

assert not np.isfinite( np.sqrt(a+b) )

dist will be NaN because the sum in https://github.com/lmcinnes/umap/blob/dbd8031485271a5ebda8cdc87026488af3e08dc8/umap/distances.py#L754-L757 is <= 0. What would be a correct way to handle this? Is something like this appropriate here:

    a = 1.0 / n2 * (log_b - log_beta(n1, n2) - (self_denom2 - log_single_beta(n2)))
    b = 1.0 / n1 * (log_b - log_beta(n2, n1) - (self_denom1 - log_single_beta(n1)))
    ab = a+b
    return 0. if ab <= 0 else np.sqrt(ab)

If so, I'd be more than happy to submit a PR.

Full disclosure: I am not too familiar with the Dirichlet distribution and whether it can be used for the distance calculation of bit vectors. I tried looking up the algorithm, as used in this package, but could not find any publication that would propose an implementation that is similar to the current one. Would be great to get any leads on where the implementation is coming from.

Thanks a lot!

lmcinnes / umap

Handle complex solutions for the Dirichlet distance. #928