Open lukepmccombs opened 1 year ago
Yes, this is to be expected. The smaller a dataset, the easier it is to find (and add to the set) an image that is very dissimilar to all images in the dataset. For example, if you start with one image, almost any other image will increase the diversity of this set of two. Adding a third image is still easy. But imagine a set of 1,000 images— which image would significantly increase the diversity of the set (in terms of VS: significantly be different from all other images)? With a growing dataset size, the VS quickly rises and asymptotically approaches its maximum from the bottom.
On another note— how do you compute the similarity matrix K? If it’s the dot product, it’s faster to compute VS from the covariance matrix: https://github.com/vertaix/Vendi-Score#usage
vendi.score_dual(X)
K will then be a square matrix d x d, where d is the number of embedding dimensions. So, this will be the same for any number of samples. I find, that typically this is the fastest way.
Unfortunately, my similarity metric is a Gaussian kernel, so I can't formulate it as a dot product. But this was informative nonetheless!
Edit: I stand corrected, there is a dot product formulation although it contains an infinite series. I can likely truncate it though for a very similar measure.
Interesting discussion! Yes, you can scale VS to large datasets by subsampling if you don't have embeddings available. If you have embeddings available you can use what we used in several experiments in the paper and that @sebastianberns pointed out.
Thank you for highlighting this.
I'm actually now looking into a related paper that I found while scrounging around the internet. They detail a polynomial estimation of the von Neumann entropy that they applied to matrices of up to 72,000,000x72,000,000! It should be possible to compute that and then convert it to the Vendi Score using the relation in Eq. 2 of your paper. I'm still working out the code for it, but if it proves fruitful I'll drop it in here.
I've replicated Table 1. in python, although in my testing it didn't hold up for matrices that weren't extremely diagonal. There may be something wrong with my math, but I've checked it over several times to no avail.
import numpy as np
import scipy
from functools import cache
from vendi_score.vendi import entropy_q
@cache
def _a(x0, k):
if k == 0:
return x0 * (np.log(x0 / 4) + 1)
if k == 1:
return x0 / 4 * (2 * np.log(x0 / 4) + 3)
return x0 * (-1) ** k / (k * (k ** 2 - 1))
def _p(x, n, x0=1):
b = np.zeros((n+3, len(x)))
for k in range(n, -1, -1):
b[k] = _a(x0, k) \
+ 2 * (2 / x0 * x - 1) * b[k+1] \
- b[k+2]
return (b[0] - b[2]) / 2
def _pA(A, v, n, x0, gam0):
y = np.zeros((n+3, len(A)))
_coeff = (4 / x0 / gam0)
for k in range(n, -1, -1):
y[k] = _a(x0, k) * v \
+ _coeff * (A @ y[k+1]) \
- 2 * y[k+1] - y[k+2]
# from https://en.wikipedia.org/wiki/Clenshaw_algorithm, extra recurrence case
return gam0 / 2 * (v @ (y[0] - y[2]))
def estimate_entropy(A, n, p, x0=1, gam0=None):
m = len(A)
if gam0 is None:
gam0 = scipy.linalg.eigh(A, eigvals_only=True, subset_by_index=(m-1, m-1))[0]
i = 0
N = 1
xi_mi = np.inf
xi_ma = np.NINF
moving_avg = 0
while i < N:
ohm = np.random.choice([-1., 1.], m)
xi = _pA(A, ohm, n, x0, gam0)
i += 1
moving_avg += (xi - moving_avg) / i
if xi < xi_mi or xi > xi_ma:
xi_mi = np.min([xi, xi_mi])
xi_ma = np.max([xi, xi_ma])
delta = xi_ma - xi_mi + m * x0 * gam0 / (n * (n + 1))
N = 2 * n ** 2 * (n + 1) ** 2 * delta ** 2 * np.log(2 / (1 - p)) \
/ (m ** 2 * x0 ** 2 * gam0 ** 2)
von_entropy = -(moving_avg + np.log(gam0) * np.trace(A))
tol = (m * x0 * gam0) / (2 * n * (n + 1)) \
+ delta * (np.log(2 / (1-p)) / (2 * i))
return von_entropy, tol, i
def vendi_entropy(A):
eig = np.linalg.eigvalsh(A)
return entropy_q(eig)
def stiffness_matrix(m):
return 2 * np.eye(m) - np.eye(m, k=-1) - np.eye(m, k=1)
if __name__ == "__main__":
import pandas as pd
df = pd.DataFrame({
"m": (10, 50, 100, 500, 1000, 5000),
"n": (2, 3, 3, 4, 6, 8)
})
for idx, row in df.iterrows():
m, n = row["m"], row["n"]
A = stiffness_matrix(m)
df.loc[idx, "exact entropy"] = vendi_entropy(A)
df.loc[idx, ["estimated entropy", "tolerance", "N"]] \
= estimate_entropy(A, n, 0.95)
df["absolute error"] = np.abs(df["exact entropy"] - df["estimated entropy"])
df["relative error"] = np.abs(df["absolute error"] / df["exact entropy"])
print(df)
m n exact entropy estimated entropy tolerance N absolute error relative error
0 10 2 -19.232387 -19.688055 3.934814 43.0 0.455667 0.023693
1 50 3 -99.227642 -99.806790 10.031345 43.0 0.579148 0.005837
2 100 3 -199.227470 -199.790557 20.724314 31.0 0.563087 0.002826
3 500 4 -999.227414 -991.291497 63.669972 24.0 7.935916 0.007942
4 1000 6 -1999.227412 -2003.573015 61.088217 23.0 4.345603 0.002174
5 5000 8 -9999.227411 -9994.651389 173.238287 30.0 4.576022 0.000458
Nice find, @lukepmccombs. Thanks for sharing the code. How much faster is this approximation than the exact solution?
although in my testing it didn't hold up for matrices that weren't extremely diagonal
The paper you referenced focuses on real symmetric, positive semidefinite matrices. So, I wouldn’t expect this to work in all cases. Also, the Vendi Score has the same requirement (cf. definition 3.1).
Testing for semidefiniteness either includes computing the eigenvalues (which defeats the purpose of approximation) or checking for Cholesky factorisation (can be a bit tricky). So, difficult to set up a reliable pre-check. Instead, one has to make sure that the similarity function is positive semidefinite.
However, for testing purposes, there’s a sklearn function to generate positive semidefinite matrices.
from sklearn.datasets import make_spd_matrix
A = make_spd_matrix(n_dim=m)
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_spd_matrix.html
I've been testing with symmetric, positive semidefinite matrices; I think I've narrowed down the cases where it performs poorly. For a polynomial of degree n and matrix of size mxm, there is a lower bound on the accuracy of m * largest eigenvalue / (n * (n + 1))
. I'm not a mathematician so I can't make a broad prediction, but empirically with Vendi Score this shows as poor estimations if score << m. I'm testing with a Gaussian kernel like I referenced at the start of the thread.
As for performance, on my pc it beats out the true calculation around m = 5000
. Time taken increases with both n and p,
although I'm unsure of the exact relation. I've done some optimization; the workhorse is the matrix form Clenshaw algorithm in _pA(...)
. I'll try to figure out some more.
m n p exact vendi exact time estimated vendi tolerance N estimate time absolute error relative error
0 10 2 0.950 9.223711 0.000708 7.803903 2.270911 11.0 0.001428 1.419808 15.393020
1 50 3 0.950 46.274479 0.000401 46.528787 8.389680 11.0 0.001276 0.254308 0.549565
2 100 3 0.950 80.539342 0.000919 76.815894 20.795618 12.0 0.008455 3.723448 4.623142
3 500 4 0.950 232.521316 0.023487 244.307122 101.030846 15.0 0.006331 11.785807 5.068699
4 1000 6 0.950 290.131693 0.050769 320.586467 114.664005 23.0 0.015712 30.454774 10.496879
5 5000 8 0.950 345.894688 6.466299 532.034511 421.793342 13.0 0.810432 186.139823 53.814016
6 10000 8 0.950 356.653767 56.441622 795.107935 1053.969191 12.0 2.645059 438.454168 122.935522
7 10000 16 0.950 356.523837 50.743377 434.685242 152.074416 29.0 10.182829 78.161406 21.923192
8 10000 16 0.990 355.815377 44.181685 414.606756 144.001012 34.0 11.986754 58.791379 16.523001
9 10000 32 0.999 356.555307 44.006463 373.197696 34.563903 503.0 354.955230 16.642389 4.667548
Performance is bottlenecked by matrix multiplication, so not anything I can really affect. I'm uncertain if this is really useful since the errors can be so crippling with large arrays. Reducing them to a reasonable level may require such high n and p that calculating the raw entropy would take just as long. The column sampling approach may be better (I think bootstrapping results in just a worse form of that). I'll try to get a run going on my University's cluster for an extra large array to see.
Edit: Trimmed a single matrix multiplication from _pA(...)
, so its a little faster across the board. Also rotate buffers, which may help with extra large arrays.
Sidenote, unlike when using numpy, this algorithm is embarrassingly parallel. The sampling passes can be run on distributed workers, initialized with the input parameters and continually churning out xi
estimates until the main thread reaches its stopping condition. With high n
and p
it may not be faster per compute, but you can leverage HPC much better for big data.
I've found that when the dataset is large and expensive to compute, or unable to fit in my laptop's memory, bootstrapping can somewhat well approximate the vendi score. For a dataset of 10000 similarity scores with K only computed once each:
100 samples of 200 datapoints each: 20.96 with std 1.826 in 6.0s
30 samples of 1000 datapoints each: 24.72 with std 1.057 in 12.4s
Raw vendi: 25.857 in 199.3s
As I decrease the sample size, no matter the number of bootstraps it trends lower with a higher variance. For larger sample sizes it works great for a quick and dirty estimate though. Note that I sampled without replacement, since replacement directly duplicates rows and artificially reduces diversity.
Any thoughts on the efficacy of calculating vendi score in this manner? This isn't a rigorous test by any means, but I thought it would be of interest to others.