Dynamically choosing the pairs to compare in an experiment to limit the number of comparisons

wegel commented 5 years ago

Hi,

I have about 30,000 images that I need to rank on a linear scale (let's say from 1 to 10). The naive approach would dictate to do a full pairwise comparison (n² comparisons), making the task impossible, however I'm searching for a method that would allow me to only do about nlogn comparisons by "wisely" choosing the pairs to compare. Of course, using such a method we can only hope to get "near" the ranking ground truth, but it should be sufficient for my needs.

I'm under the impression that choix could be used to attain my goal, but I can't determine how exactly at this point. Any pointers that might help? Thanks!

lucasmaystre commented 5 years ago

Hi @wegel, that's a great question. Supposing that you have a budget of k comparisons, there are two ways to go about it:

selecting all k comparisons before observing the outcomes.
iteratively & adaptively select pairs of items to compare based on the outcomes of previous comparisons

For (1), there is some theory that suggests that you cannot do much better that selecting k pairs uniformly at random among all n² pairs.

For (2), a.k.a. the active learning setting, there is a number of approaches, but it is still very much an active area of research. Some alternatives:

In my opinion, the best approach is to greedily maximize the expected information gain, i.e., to follow the approach of this paper. But the cost of this is O(n³), so not sure if it'll work in your case. I have some code for this if you want to try - let me know and I'll add you to the (private) repo.
Another approach is to use a search algorithm (such as Mergesort) as a "heuristic" to select pairs. This idea is explained in this paper (which is part of my PhD thesis).
This recent paper seems to provide another approach.

Hope this helps!

wegel commented 5 years ago

Interesting. If I chose (1), how would I know (potentially using choix) that the randomly selected pairs were sufficient to calculate with good certainty the rank of all items? Would I just call choix.ilsr_pairwise and check if it raises a RuntimeError: Did not converge after 100 iterations, and if did raise the error, maybe select a few other random pairs to compare? How can I determine that the ranking truthfulness has improved at each iteration in that case?

Intuitively, using something similar to quicksort for this problem seems like a good fit, so I'll read your paper with attention (and try to understand how I would actually implement/program it). Thanks!

lucasmaystre commented 5 years ago

Regarding your first point, I don't think there is any ideal way to do this. In particular, convergence is not necessarily the right thing to look at. What I would look at is as follows:

stability: if I train the model on part of the data vs. all of the data, does the ranking change by much? (you can use choix.{footrule,kendall}_dist or look at the norm of the difference between parameter vectors)
predictive accuracy: splitting the data the data into two sets A, B, how well does a model trained on A predict comparison outcomes in B?

In either case, it is not clear when to stop adding pairs - at some point you will probably just need to decide that the results are "good enough".

sethtroisi commented 5 years ago

This discussion is interesting for us (tensorflow/minigo). We're using (2) to do dynamic pairing of different NN models which are very expensive to evaluate (~1-3minutes) per eval.

We do a variation on active learning where we select a couple of points that maximize expected information gain and evaluate them. We stop not after k but after std_err is below some threshold.

I've been evaluating the goodness of ranking via the inverse of the hessian (justification was this post I think)

        hessian = choix.opt.PairwiseFcts(pairs, penalty=.1).hessian(ilsr_param)
        std_err = np.sqrt(np.diagonal(np.linalg.inv(hessian)))

lucasmaystre commented 5 years ago

@sethtroisi very nice. What you are doing can also be thought of as computing the Laplace approximation of the posterior distribution.

An alternative would be:

mean, cov = choix.ep_pairwise(n_items, pairs, 0.1, model="logit")
std_err = np.sqrt(np.diagonal(cov))

This computes an approximate posterior using the EP algorithm, which is in some cases (marginally) more accurate. But also more expensive, since you need multiple matrix inversions.

sethtroisi commented 5 years ago

Thanks for the links. I'll try it out. We are slightly sensitive to speed, we have 1.5M pairs over ~10K models takes several minutes to converge and invert.

A related question I've been curious about. What does penalty represent? Is there some reasonable way to choice it?

lucasmaystre commented 5 years ago

@sethtroisi penalty is simply the L2-regularization hyperparameter. In the inference functions exposed in the top-level package, it is called alpha.

In other words: if you minimize choix.opt.PairwiseFcts(pairs, penalty=x).objective (which you would do when calling choix.opt_pairwise(n_items, pairs, alpha=x)) then you are computing the maximum a-posteriori estimate of the parameters, under a spherical Gaussian prior with variance 1/x.

If you can make a guess about the range of the parameters, you can use that to set penalty. Otherwise, you can estimate the best value using cross-validation or similar. Informally, 0.1 seems like a decent default value :-)

sethtroisi commented 5 years ago

@lucasmaystre My math background is more in discrete math so I'm Googling everything you say to trying to understand. Thanks again for following up.

ep_pairwise seems to always throw an error I wrote up this simple test program to demonstrate

I'm using this to power the ratings page for Minigo, on this data set (~150,000 games) it ran for >10 minutes before I killed it. I tried with a smaller data set (~10000 games but it still

my params have a range of ~25 with penalty 0.1 my std_err go from 0.09 to 1.80 with penalty 0.05 my std_err go from 0.11 to 2.19 with penalty 0.2 my std_err go from 0.08 to 1.40

I guess I'll spend sometime seeing what minimizes error on a holdout game set.

lucasmaystre commented 5 years ago

@sethtroisi regarding your test program: some of the pairs you generate on line 16 have twice the same player. This throws ep_pairwise off. If you replace line 16 by:

players = random.sample(range(M), 2)

I think it should work.

In general I would expect ep_pairwise to be significantly slower than what you used until now (by a factor 10x-100x). There are ways to speed this up, but I'm not sure it's worth it - simply using the inverse hessian at the maximum-likelihood estimate of the parameters, like you did so far, should be good enough in practice.

sethtroisi commented 5 years ago

@lucasmaystre random.sample(range(M, 2) is clearly better, that was a silly oversight on my part.

which is in some cases (marginally) more accurate. But also more expensive, since you need multiple matrix inversions. I misread the "marginally" as applying to both statements (accuracy, and cost).

I converted the script to a Colab (colab is a Google hosted + more magical version of Jupyter notebooks) and printed some plots about shift in ratings and std_err and speed.

TL;DR ep_pairwise is very similar but 20-30x slower. which is to slow for my use case :/

lucasmaystre / choix

Dynamically choosing the pairs to compare in an experiment to limit the number of comparisons #9