Closed wegel closed 5 years ago
Hi @wegel, that's a great question. Supposing that you have a budget of k comparisons, there are two ways to go about it:
For (1), there is some theory that suggests that you cannot do much better that selecting k pairs uniformly at random among all n² pairs.
For (2), a.k.a. the active learning setting, there is a number of approaches, but it is still very much an active area of research. Some alternatives:
Hope this helps!
Interesting. If I chose (1), how would I know (potentially using choix) that the randomly selected pairs were sufficient to calculate with good certainty the rank of all items? Would I just call choix.ilsr_pairwise
and check if it raises a RuntimeError: Did not converge after 100 iterations
, and if did raise the error, maybe select a few other random pairs to compare? How can I determine that the ranking truthfulness has improved at each iteration in that case?
Intuitively, using something similar to quicksort for this problem seems like a good fit, so I'll read your paper with attention (and try to understand how I would actually implement/program it). Thanks!
Regarding your first point, I don't think there is any ideal way to do this. In particular, convergence is not necessarily the right thing to look at. What I would look at is as follows:
choix.{footrule,kendall}_dist
or look at the norm of the difference between parameter vectors)In either case, it is not clear when to stop adding pairs - at some point you will probably just need to decide that the results are "good enough".
This discussion is interesting for us (tensorflow/minigo). We're using (2) to do dynamic pairing of different NN models which are very expensive to evaluate (~1-3minutes) per eval.
We do a variation on active learning where we select a couple of points that maximize expected information gain and evaluate them. We stop not after k but after std_err is below some threshold.
I've been evaluating the goodness of ranking via the inverse of the hessian (justification was this post I think)
hessian = choix.opt.PairwiseFcts(pairs, penalty=.1).hessian(ilsr_param)
std_err = np.sqrt(np.diagonal(np.linalg.inv(hessian)))
@sethtroisi very nice. What you are doing can also be thought of as computing the Laplace approximation of the posterior distribution.
An alternative would be:
mean, cov = choix.ep_pairwise(n_items, pairs, 0.1, model="logit")
std_err = np.sqrt(np.diagonal(cov))
This computes an approximate posterior using the EP algorithm, which is in some cases (marginally) more accurate. But also more expensive, since you need multiple matrix inversions.
Thanks for the links. I'll try it out. We are slightly sensitive to speed, we have 1.5M pairs over ~10K models takes several minutes to converge and invert.
A related question I've been curious about. What does penalty represent? Is there some reasonable way to choice it?
@sethtroisi penalty
is simply the L2-regularization hyperparameter. In the inference functions exposed in the top-level package, it is called alpha
.
In other words: if you minimize choix.opt.PairwiseFcts(pairs, penalty=x).objective
(which you would do when calling choix.opt_pairwise(n_items, pairs, alpha=x)
) then you are computing the maximum a-posteriori estimate of the parameters, under a spherical Gaussian prior with variance 1/x
.
If you can make a guess about the range of the parameters, you can use that to set penalty
. Otherwise, you can estimate the best value using cross-validation or similar. Informally, 0.1
seems like a decent default value :-)
@lucasmaystre My math background is more in discrete math so I'm Googling everything you say to trying to understand. Thanks again for following up.
ep_pairwise
seems to always throw an error I wrote up this simple test program to demonstrate
I'm using this to power the ratings page for Minigo, on this data set (~150,000 games) it ran for >10 minutes before I killed it. I tried with a smaller data set (~10000 games but it still
my params have a range of ~25 with penalty 0.1 my std_err go from 0.09 to 1.80 with penalty 0.05 my std_err go from 0.11 to 2.19 with penalty 0.2 my std_err go from 0.08 to 1.40
I guess I'll spend sometime seeing what minimizes error on a holdout game set.
@sethtroisi regarding your test program: some of the pairs you generate on line 16 have twice the same player. This throws ep_pairwise
off. If you replace line 16 by:
players = random.sample(range(M), 2)
I think it should work.
In general I would expect ep_pairwise
to be significantly slower than what you used until now (by a factor 10x-100x). There are ways to speed this up, but I'm not sure it's worth it - simply using the inverse hessian at the maximum-likelihood estimate of the parameters, like you did so far, should be good enough in practice.
@lucasmaystre random.sample(range(M, 2) is clearly better, that was a silly oversight on my part.
which is in some cases (marginally) more accurate. But also more expensive, since you need multiple matrix inversions. I misread the "marginally" as applying to both statements (accuracy, and cost).
I converted the script to a Colab (colab is a Google hosted + more magical version of Jupyter notebooks) and printed some plots about shift in ratings and std_err and speed.
TL;DR ep_pairwise is very similar but 20-30x slower. which is to slow for my use case :/
Hi,
I have about 30,000 images that I need to rank on a linear scale (let's say from 1 to 10). The naive approach would dictate to do a full pairwise comparison (n² comparisons), making the task impossible, however I'm searching for a method that would allow me to only do about nlogn comparisons by "wisely" choosing the pairs to compare. Of course, using such a method we can only hope to get "near" the ranking ground truth, but it should be sufficient for my needs.
I'm under the impression that choix could be used to attain my goal, but I can't determine how exactly at this point. Any pointers that might help? Thanks!