lsr_pairwise: highest score seems to be off

fohrloop commented 3 weeks ago

Reproducible example

Using 15 items (14 pairs), ordered in a line. The expectation is that the scores would form a straight line.

import choix
import matplotlib.pyplot as plt

data = [
    (0, 1),
    (1, 2),
    (2, 3),
    (3, 4),
    (4, 5),
    (5, 6),
    (6, 7),
    (7, 8),
    (8, 9),
    (9, 10),
    (10, 11),
    (11, 12),
    (12, 13),
    (13, 14),
]

scores = choix.lsr_pairwise(n_items=15, data=data, alpha=1e-4)
print("scores: ", scores)

plt.plot(range(len(scores)), scores, marker="o")
plt.show()

this shows:

I would expect the scores to be a continuous straight line, so the first score seems to be off. For example with the ilsr_pairwise you get:

which is a curve with some second derivative, but it turns into ~straight line when a very small alpha (1e-23) is used.

Using choix 0.3.5, scipy 1.14.1 on CPython 3.12.6.

lucasmaystre commented 1 week ago

Hi @fohrloop , thanks for reporting this, and apologies for the delay. I will look into this as soon as I get a chance & get back to you.

lucasmaystre commented 2 days ago

I've investigated this issue.

I agree with you that the result is not very intuitive, but upon closer look this is actually expected behavior for lsr_pairwise (and likely for lsr_* algorithms in general). I've verified that the solution you get is the correct one, i.e., the solution corresponds to the stationary distribution of the Markov chain implied by the data and alpha.

For ilsr_*, in the specific output you show (for alpha=1e-4) the fact that the curve is seemingly concave is likely due to the fact that we stop the iterative process before it has fully converged. In general you might also get somewhat unintuitive results for ilsr_* algorithms, e.g., in the example above when setting alpha to something larger, like 0.1.

For small datasets where some items never "win" or never "lose" (as is the case for items 0 and 14 in your example) I recommend using the opt_ functions instead; I think the output is more intuitive.

fohrloop commented 2 days ago

Thanks a lot @lucasmaystre for taking a look and providing throughout explanation! I've been using the opt_pairwise successfully in my use case, and good to know that it would be more suitable for smaller (or "sparser") datasets!

lucasmaystre / choix

lsr_pairwise: highest score seems to be off #24

Reproducible example