vertaix / Vendi-Score

MIT License
97 stars 8 forks source link

Thoughts on Bootstrapping #8

Open lukepmccombs opened 1 year ago

lukepmccombs commented 1 year ago

I've found that when the dataset is large and expensive to compute, or unable to fit in my laptop's memory, bootstrapping can somewhat well approximate the vendi score. For a dataset of 10000 similarity scores with K only computed once each:

100 samples of 200 datapoints each: 20.96 with std 1.826 in 6.0s

30 samples of 1000 datapoints each: 24.72 with std 1.057 in 12.4s

Raw vendi: 25.857 in 199.3s

As I decrease the sample size, no matter the number of bootstraps it trends lower with a higher variance. For larger sample sizes it works great for a quick and dirty estimate though. Note that I sampled without replacement, since replacement directly duplicates rows and artificially reduces diversity.

Any thoughts on the efficacy of calculating vendi score in this manner? This isn't a rigorous test by any means, but I thought it would be of interest to others.

sebastianberns commented 1 year ago

Yes, this is to be expected. The smaller a dataset, the easier it is to find (and add to the set) an image that is very dissimilar to all images in the dataset. For example, if you start with one image, almost any other image will increase the diversity of this set of two. Adding a third image is still easy. But imagine a set of 1,000 images— which image would significantly increase the diversity of the set (in terms of VS: significantly be different from all other images)? With a growing dataset size, the VS quickly rises and asymptotically approaches its maximum from the bottom.

On another note— how do you compute the similarity matrix K? If it’s the dot product, it’s faster to compute VS from the covariance matrix: https://github.com/vertaix/Vendi-Score#usage

vendi.score_dual(X)

K will then be a square matrix d x d, where d is the number of embedding dimensions. So, this will be the same for any number of samples. I find, that typically this is the fastest way.

lukepmccombs commented 1 year ago

Unfortunately, my similarity metric is a Gaussian kernel, so I can't formulate it as a dot product. But this was informative nonetheless!

Edit: I stand corrected, there is a dot product formulation although it contains an infinite series. I can likely truncate it though for a very similar measure.

adjidieng commented 1 year ago

Interesting discussion! Yes, you can scale VS to large datasets by subsampling if you don't have embeddings available. If you have embeddings available you can use what we used in several experiments in the paper and that @sebastianberns pointed out.

Thank you for highlighting this.

lukepmccombs commented 1 year ago

I'm actually now looking into a related paper that I found while scrounging around the internet. They detail a polynomial estimation of the von Neumann entropy that they applied to matrices of up to 72,000,000x72,000,000! It should be possible to compute that and then convert it to the Vendi Score using the relation in Eq. 2 of your paper. I'm still working out the code for it, but if it proves fruitful I'll drop it in here.

lukepmccombs commented 1 year ago

I've replicated Table 1. in python, although in my testing it didn't hold up for matrices that weren't extremely diagonal. There may be something wrong with my math, but I've checked it over several times to no avail.

import numpy as np
import scipy
from functools import cache
from vendi_score.vendi import entropy_q

@cache
def _a(x0, k):
    if k == 0:
        return x0 * (np.log(x0 / 4) + 1)
    if k == 1:
        return x0 / 4 * (2 * np.log(x0 / 4) + 3)
    return x0 * (-1) ** k / (k * (k ** 2 - 1))

def _p(x, n, x0=1):
    b = np.zeros((n+3, len(x)))
    for k in range(n, -1, -1):
        b[k] = _a(x0, k) \
            + 2 * (2 / x0 * x - 1) * b[k+1] \
            - b[k+2]
    return (b[0] - b[2]) / 2

def _pA(A, v, n, x0, gam0):
    y = np.zeros((n+3, len(A)))
    _coeff = (4 / x0 / gam0)
    for k in range(n, -1, -1):
        y[k] = _a(x0, k) * v \
            + _coeff * (A @ y[k+1]) \
            - 2 * y[k+1] - y[k+2]

    # from https://en.wikipedia.org/wiki/Clenshaw_algorithm, extra recurrence case
    return gam0 / 2 * (v @ (y[0] - y[2]))

def estimate_entropy(A, n, p, x0=1, gam0=None):
    m = len(A)
    if gam0 is None:
        gam0 = scipy.linalg.eigh(A, eigvals_only=True, subset_by_index=(m-1, m-1))[0]

    i = 0
    N = 1

    xi_mi = np.inf
    xi_ma = np.NINF

    moving_avg = 0
    while i < N:
        ohm = np.random.choice([-1., 1.], m)
        xi = _pA(A, ohm, n, x0, gam0)

        i += 1
        moving_avg += (xi - moving_avg) / i
        if xi < xi_mi or xi > xi_ma:            
            xi_mi = np.min([xi, xi_mi])
            xi_ma = np.max([xi, xi_ma])

            delta = xi_ma - xi_mi + m * x0 * gam0 / (n * (n + 1))
            N = 2 * n ** 2 * (n + 1) ** 2 * delta ** 2 * np.log(2 / (1 - p)) \
                / (m ** 2 * x0 ** 2 * gam0 ** 2)

    von_entropy = -(moving_avg + np.log(gam0) * np.trace(A))
    tol = (m * x0 * gam0) / (2 * n * (n + 1)) \
        + delta * (np.log(2 / (1-p)) / (2 * i))
    return von_entropy, tol, i

def vendi_entropy(A):
    eig = np.linalg.eigvalsh(A)
    return entropy_q(eig)

def stiffness_matrix(m):
    return 2 * np.eye(m) - np.eye(m, k=-1) - np.eye(m, k=1)

if __name__ == "__main__":
    import pandas as pd

    df = pd.DataFrame({
        "m": (10, 50, 100, 500, 1000, 5000),
        "n": (2, 3, 3, 4, 6, 8)
    })

    for idx, row in df.iterrows():
        m, n = row["m"], row["n"]
        A = stiffness_matrix(m)

        df.loc[idx, "exact entropy"] = vendi_entropy(A)
        df.loc[idx, ["estimated entropy", "tolerance", "N"]] \
            = estimate_entropy(A, n, 0.95)

    df["absolute error"] = np.abs(df["exact entropy"] - df["estimated entropy"])
    df["relative error"] = np.abs(df["absolute error"] / df["exact entropy"])
    print(df)
      m  n  exact entropy  estimated entropy   tolerance     N  absolute error  relative error
0    10  2     -19.232387         -19.688055    3.934814  43.0        0.455667        0.023693
1    50  3     -99.227642         -99.806790   10.031345  43.0        0.579148        0.005837
2   100  3    -199.227470        -199.790557   20.724314  31.0        0.563087        0.002826
3   500  4    -999.227414        -991.291497   63.669972  24.0        7.935916        0.007942
4  1000  6   -1999.227412       -2003.573015   61.088217  23.0        4.345603        0.002174
5  5000  8   -9999.227411       -9994.651389  173.238287  30.0        4.576022        0.000458
sebastianberns commented 1 year ago

Nice find, @lukepmccombs. Thanks for sharing the code. How much faster is this approximation than the exact solution?

although in my testing it didn't hold up for matrices that weren't extremely diagonal

The paper you referenced focuses on real symmetric, positive semidefinite matrices. So, I wouldn’t expect this to work in all cases. Also, the Vendi Score has the same requirement (cf. definition 3.1).

Testing for semidefiniteness either includes computing the eigenvalues (which defeats the purpose of approximation) or checking for Cholesky factorisation (can be a bit tricky). So, difficult to set up a reliable pre-check. Instead, one has to make sure that the similarity function is positive semidefinite.

However, for testing purposes, there’s a sklearn function to generate positive semidefinite matrices.

from sklearn.datasets import make_spd_matrix

A = make_spd_matrix(n_dim=m)

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_spd_matrix.html

lukepmccombs commented 1 year ago

I've been testing with symmetric, positive semidefinite matrices; I think I've narrowed down the cases where it performs poorly. For a polynomial of degree n and matrix of size mxm, there is a lower bound on the accuracy of m * largest eigenvalue / (n * (n + 1)). I'm not a mathematician so I can't make a broad prediction, but empirically with Vendi Score this shows as poor estimations if score << m. I'm testing with a Gaussian kernel like I referenced at the start of the thread.

As for performance, on my pc it beats out the true calculation around m = 5000. Time taken increases with both n and p, although I'm unsure of the exact relation. I've done some optimization; the workhorse is the matrix form Clenshaw algorithm in _pA(...). I'll try to figure out some more.

       m   n      p  exact vendi  exact time  estimated vendi    tolerance      N  estimate time  absolute error  relative error
0     10   2  0.950     9.223711    0.000708         7.803903     2.270911   11.0       0.001428        1.419808       15.393020
1     50   3  0.950    46.274479    0.000401        46.528787     8.389680   11.0       0.001276        0.254308        0.549565
2    100   3  0.950    80.539342    0.000919        76.815894    20.795618   12.0       0.008455        3.723448        4.623142
3    500   4  0.950   232.521316    0.023487       244.307122   101.030846   15.0       0.006331       11.785807        5.068699
4   1000   6  0.950   290.131693    0.050769       320.586467   114.664005   23.0       0.015712       30.454774       10.496879
5   5000   8  0.950   345.894688    6.466299       532.034511   421.793342   13.0       0.810432      186.139823       53.814016
6  10000   8  0.950   356.653767   56.441622       795.107935  1053.969191   12.0       2.645059      438.454168      122.935522
7  10000  16  0.950   356.523837   50.743377       434.685242   152.074416   29.0      10.182829       78.161406       21.923192
8  10000  16  0.990   355.815377   44.181685       414.606756   144.001012   34.0      11.986754       58.791379       16.523001
9  10000  32  0.999   356.555307   44.006463       373.197696    34.563903  503.0     354.955230       16.642389        4.667548
Updated Code ```python import numpy as np from vendi_score.vendi import weight_K, score_K def _ak(x0, n): def _a(x0, k): if k == 0: return x0 * (np.log(x0 / 4) + 1) if k == 1: return x0 / 4 * (2 * np.log(x0 / 4) + 3) return x0 * (-1) ** k / (k * (k ** 2 - 1)) a = [] for k in range(n): a.append(_a(x0, k)) return np.array(a) def _pA(A, ak_rev, v, x0, gam0): y = yk1 = np.zeros(len(A)) av = ak_rev[:, None] * v[None, :] _coeff = (4 / x0 / gam0) for kv in av: yk1, yk2 = y, yk1 y = kv + _coeff * (A @ yk1) - 2 * yk1 - yk2 # from https://en.wikipedia.org/wiki/Clenshaw_algorithm, extra recurrence case return gam0 / 2 * (v @ (y - yk2)) def _gershgorin_bound(A): # https://en.wikipedia.org/wiki/Gershgorin_circle_theorem # https://scicomp.stackexchange.com/questions/1801/gershgorin-circle-theorem-to-estimate-the-eigenvalues abs_zero_eye = np.abs(A * (1 - np.eye(len(A)))) return np.max(np.diag(A) + np.sum(abs_zero_eye, axis=1)) def estimate_entropy(A, n, p, x0=1, lambda0=None): m = len(A) if lambda0 is None: lambda0 = _gershgorin_bound(A) gam0 = lambda0 / x0 i = 0 N = 1 ak_rev = _ak(x0, n)[::-1] xi_mi = np.inf xi_ma = np.NINF moving_avg = 0 while i < N: ohm = np.random.choice([-1., 1.], m) xi = _pA(A, ak_rev, ohm, x0, gam0) i += 1 moving_avg += (xi - moving_avg) / i if xi < xi_mi or xi > xi_ma: xi_mi = np.min([xi, xi_mi]) xi_ma = np.max([xi, xi_ma]) delta = xi_ma - xi_mi + m * x0 * gam0 / (n * (n + 1)) N = 2 * n ** 2 * (n + 1) ** 2 * delta ** 2 * np.log(2 / (1 - p)) \ / (m ** 2 * x0 ** 2 * gam0 ** 2) von_entropy = -(moving_avg + np.log(gam0) * np.trace(A)) tol = (m * x0 * gam0) / (2 * n * (n + 1)) \ + delta * (np.log(2 / (1-p)) / (2 * i)) ** 0.5 return von_entropy, tol, i if __name__ == "__main__": import pandas as pd from contextlib import contextmanager import time from scipy.spatial.distance import pdist, squareform def gaussian_kernel(data, sigma): dist = pdist(data, "euclidean") kernel = np.exp(- dist ** 2 / sigma ** 2) return squareform(kernel) + np.eye(len(data)) @contextmanager def time_me(): t = time.time() yield lambda: time.time() - t df = pd.DataFrame({ "m": (10, 50, 100, 500, 1000, 5000, 10000, 10000, 10000, 10000), "n": (2, 3, 3, 4, 6, 8, 8, 16, 16, 32), "p": ((0.95,) * 8) + (0.99, 0.999) }) for idx, row in df.iterrows(): m, n, p = row["m"].astype(int), row["n"].astype(int), row["p"] data = np.random.uniform(-1, 1, (m, 2)) K = gaussian_kernel(data, 0.1) with time_me() as ct: df.loc[idx, "exact vendi"] = score_K(K) df.loc[idx, "exact time"] = ct() with time_me() as ct: K_ = weight_K(K) vN, tol, N = estimate_entropy(K_, n, p) H = np.exp(vN) df.loc[idx, ["estimated vendi", "tolerance", "N"]] = H, H * tol, N df.loc[idx, "estimate time"] = ct() df["absolute error"] = np.abs(df["exact vendi"] - df["estimated vendi"]) df["relative error"] = np.abs(df["absolute error"] / df["exact vendi"]) * 100 print(df) ```
lukepmccombs commented 1 year ago

Performance is bottlenecked by matrix multiplication, so not anything I can really affect. I'm uncertain if this is really useful since the errors can be so crippling with large arrays. Reducing them to a reasonable level may require such high n and p that calculating the raw entropy would take just as long. The column sampling approach may be better (I think bootstrapping results in just a worse form of that). I'll try to get a run going on my University's cluster for an extra large array to see.

Edit: Trimmed a single matrix multiplication from _pA(...), so its a little faster across the board. Also rotate buffers, which may help with extra large arrays.

Updated block ```python def _pA(A, ak_rev, v, x0, gam0): av = ak_rev[:, None] * v[None, :] _coeff = (4 / x0 / gam0) y, yk1, yk2 = av[0].copy(), np.zeros(len(A)), np.zeros(len(A)) for kv in av[1:]: # Rotate buffers yk1, yk2, y = y, yk1, yk2 np.matmul(A, yk1, out=y) y *= _coeff y += kv - 2 * yk1 - yk2 # from https://en.wikipedia.org/wiki/Clenshaw_algorithm, extra recurrence case return gam0 / 2 * (v @ (y - yk2)) ```

Sidenote, unlike when using numpy, this algorithm is embarrassingly parallel. The sampling passes can be run on distributed workers, initialized with the input parameters and continually churning out xi estimates until the main thread reaches its stopping condition. With high n and p it may not be faster per compute, but you can leverage HPC much better for big data.