Boundary updates - Githubissues

segasai commented 1 year ago

This is a tentative fix to the BIC criterion for ellipsoid splitting introduced in #286 There a factor of two was skipped in the calculation making the splitting of ellipsoids harder. Apparently that leads to significant slowness observed in #440

I'm a bit torn by this issue and whether it's right change. One wouldn't want the results to be that sensitive to that factor of two here and I'm concerned if this will break some other cases...

One thing that this change trips is the test_ellipsoids test case https://github.com/joshspeagle/dynesty/blob/d9164354421f4a010ee2bc232cb4eda2f4f90a88/tests/test_ellipsoid.py#L265 where I check if the number of clusters in 6 x 6 x 6 x 6 grid in 4d is within 10% of 6^4. I don't know if it's a serious concern or not...

Thoughts ?

(sorry for creating a 2nd PR on this, the first one had an incorrect branch by mistake)

segasai commented 1 year ago

Here's analysis of the performance number of clusters recovered

for the old version vs new version of the code for different number of clusters N=1,2,5,10,100,1000

Here's the code. I generate points on a grid, randomly scale it and randomly rotate it. I'd say it looks like an improvement.

import itertools
import numpy as np
import dynesty.bounding as db
import math
import scipy.linalg
import multiprocessing as mp

def random_mat(rstate, ndim):
    # random rotation matrix
    x = np.random.normal(size=(ndim, ndim))
    cov = np.cov(x)
    mat = scipy.linalg.svd(cov)[0]
    return mat

def test_number_clusters(ndim, ncens, nper_dim=10, sig=0.01, seed=1):
    # check we can recover close to a correct and large number of clusters

    npt = ncens * nper_dim * ndim
    rstate = np.random.default_rng(seed)
    # create stuff on the grid
    nxcens = int(math.ceil(ncens**(1. / ndim)))
    cens = np.zeros((ncens, ndim))
    curiter = itertools.product(*[range(nxcens)] * ndim)
    for i in range(ncens):
        cens[i] = next(curiter)
    xs = sig * rstate.normal(size=(npt, ndim)) + cens[np.arange(npt) %
                                                      len(cens), :]
    # scale dimensions
    min_scale = 1e-3
    scales = 10**rstate.uniform(np.log10(min_scale), 0, size=ndim)
    xs = xs * scales[None, :]
    # random rotation
    mat = random_mat(rstate, ndim)
    xs = xs @ mat
    E = db.bounding_ellipsoids(xs)

    return ncens, len(E.ells)

def doall(ntest=10000, nthreads=36):
    rstate = np.random.default_rng(2)
    ndims = rstate.integers(2, 20 + 1, size=ntest)
    ncens = np.array([1, 2, 5, 10, 100, 1000])
    xcens = rstate.integers(0, len(ncens), size=ntest)
    ncens = ncens[xcens]
    nper_dims = rstate.integers(2, 10 + 1, size=ntest)
    res = []
    with mp.Pool(nthreads) as poo:
        for i in range(ntest):
            res.append(
                poo.apply_async(test_number_clusters,
                                (ndims[i], ncens[i], nper_dims[i], 0.01, i)))
        for i in range(ntest):
            if i % 1000 == 0:
                print(i)
            res[i] = res[i].get()
    return res

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 5192856803

21 of 25 (84.0%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.006%) to 91.929%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
py/dynesty/bounding.py	17	21	80.95%
<!--	Total:	21	25	84.0%	-->

Totals
Change from base Build 5143604727:	-0.006%
Covered Lines:	4169
Relevant Lines:	4535

💛 - Coveralls

segasai commented 1 year ago

After some more thinking/testing. I have

scrapped the BIC change as it was leading to failures on the grid tests (and #440 is not good enough justification for the change)
After some reading on kmeans I decided that it is a good idea to rescale the points to the same variance along each axis. I didn't change the initialization of kmeans, although there is some interesting reading here ( https://doi.org/10.1016/j.patcog.2019.04.014 )
I have added one warning that complains if the bootstrapping factor is too large (i.e. if inflation is > 100 in volume). Currently that warning is repeated each time the error is encountered.
I have added a warning if the sampling is stopped due to maxcall or maxiter -- now the warning is printed that the posterior may not be properly sampled if the stopping happened if dlogz was large enough) -- that would prevent the #440 complaint .

@joshspeagle comments ?

segasai commented 1 year ago

Merged this as this have been have been sitting for too long

joshspeagle / dynesty

Boundary updates #443

Pull Request Test Coverage Report for Build 5192856803

💛 - Coveralls