open-spaced-repetition / srs-benchmark

A benchmark for spaced repetition schedulers/algorithms
https://github.com/open-spaced-repetition/fsrs4anki/wiki
65 stars 9 forks source link

Ebisu? #85

Closed andymatuschak closed 1 month ago

andymatuschak commented 7 months ago

Thank you for this very interesting analysis! If you all feel inclined to include it, I'd be curious to see how Ebisu compares.

Expertium commented 7 months ago

https://github.com/fasiha/ebisu.js/issues/23 I suggested that, but it seems that LMSherlock and @fasiha just kinda weren't very interested.

fasiha commented 7 months ago

Ebisu author here šŸ‘‹ https://github.com/fasiha/ebisu.js/issues/23 has the discussion and links to the results. Ebisu v3 release candidate didnā€™t do well! For separate reasons, Iā€™ve been working on alternatives to that version and have something I like more (see https://github.com/fasiha/ebisu/issues/66) and Iā€™m happy to support rerunning the benchmarks on that version.

I am ashamed to admit it but I havenā€™t made time to properly understand how the benchmarks here work, and I havenā€™t made time to figure out how to run FSRS etc. on the benchmarks I personally use to compare Ebisu versions. Part of the reason is, Ebisu and its benchmarks handle not just binary quizzes but also binomial and noisy-binary and passive quizzes; and itā€™s not been obvious how to adapt those quiz styles to various other probabilistic SRS systems to ensure weā€™re doing apples-to-apples comparisons.

(Background. I use a focal loss-ified log likelihood for all these various quiz typesā€”see the link above and references thereinā€”because the standard log loss/binary cross entropy (https://github.com/open-spaced-repetition/srs-benchmark/?tab=readme-ov-file#metrics) was ranking ā€œbadā€ Ebisu models higher than ā€œgoodā€ ones, i.e., ones I preferred and thought were more accurate.)

I know I should just wrap up working on Ebisu v3 and release it so folks can do benchmarks without being confused what version to run šŸ˜“ sorry! Iā€™m hoping to release v3 thisā€¦ year šŸ¤ž

Expertium commented 7 months ago

Part of the reason is, Ebisu and its benchmarks handle not just binary quizzes but also binomial and noisy-binary and passive quizzes; and itā€™s not been obvious how to adapt those quiz styles to various other probabilistic SRS systems to ensure weā€™re doing apples-to-apples comparisons.

We can benchmark any algorithm as long as it: 1) Uses interval lengths and grades, no other info (like text) 2) Outputs a number between 0 and 1 that can be interpreted as a probability

Also, Anki has 4 grades (answer buttons), so previously I suggested using different values of q0 for each grade. Idk if that suggestion makes a lot of sense, I only have surface-level knowledge of Ebisu and Bayesian stuff is pretty arcane for me.

andymatuschak commented 7 months ago

Oh, perfect! Thanks for sharing that thread. I like Ebisuā€™s approach in principle, still curious if its empirical deficits can be overcome. I like that its theory more directly handles issues like, say, the fact that if weā€™re targeting 90% retrievability, we should expect to miss 10% of items, even if their underlying stabilities are identical. Most algorithms handle that with an ad-hoc solution (eg FSRSā€™s low-pass back to default stability), and maybe thatā€™s fine, but Bayesian stats seem like a better approach in principle (tho evidently perhaps not in practice!)

Iā€™ll leave this issue open since Ebisu is still not in the official benchmark results for this repo but feel free to close if you like, since I got what I wanted! :)

Expertium commented 7 months ago

if weā€™re targeting 90% retrievability, we should expect to miss 10% of items, even if their underlying stabilities are identical. Most algorithms handle that with an ad-hoc solution (eg FSRSā€™s low-pass back to default stability

I'm not sure what you're trying to say. That just because a user pressed Again, it doesn't necessarily mean that the value of stability should be decreased? Our findings suggest that stability can drop very significantly in case of a memory lapse. Very crudely, post-lapse stability (PLS) as a function of previous S is like this: image

Of course, the actual formula is more nuanced, this is without the retrievability and difficulty dimensions and also without the constant. I just simplified it as much as I could to focus purely on the relationship between S and PLS. Once LMSherlock is less busy, we will perform an interesting analysis to try to find weaknesses in our formulas for S and PLS. But I don't expect to find any flaws with PLS=f(S).

L-M-Sherlock commented 7 months ago

The PR is here: #11. It doesn't perform well, so I haven't merged it. And the dataset has been updated, so the PR is outdated. If you're interested in the result, I would rerun the benchmark when I'm available. But I need to check whether my implementation is correct at first. It requires the help from @fasiha.

Expertium commented 3 months ago

@fasiha we're working on FSRS-5, and I will make another Reddit post about benchmarking, so if you are still interested, you can come back to implementing Ebisu in the benchmark.