open-spaced-repetition / srs-benchmark

A benchmark for spaced repetition schedulers/algorithms
https://github.com/open-spaced-repetition/fsrs4anki/wiki
65 stars 9 forks source link

[TODO] Improve Ebisu #124

Open L-M-Sherlock opened 2 months ago

L-M-Sherlock commented 2 months ago

A few points!

1. Using fractional time of quiz

I agree with @user1823:

That's just how Anki works. You can't access the exact intervals of same-day reviews

If that's the case, I recommend you publish two sets of tables: in addition to the benchmarks you already have, consider a second set, giving all algorithms the fractional time between reviews. Given this repo is called "SRS Benchmark" and not "Anki's unfortunate design decisions Benchmark" 😝 I think that would be interesting and useful for quiz authors to understand various algorithms' behavior without being limited by Anki's silliness.

Because, I've been experimenting with Ebisu v2 and v3 (split algorithm) models with fractional times in the FSRS dataset and there's a lot of really interesting behavior from an Ebisu perspective. As a small example, here's the ROC for four models for the first million flashcards in the FSRS dataset picking 10% of users:

split-auc-1727021709 7899158

model AUC (higher is better) focal loss (higher is better)
a nice Ebisu 3-split model 0.74 -1e4
a nice Ebisu v2 model 0.73 -5e4
Your recently-proposed "best" v2 model 0.70 -2e4
Your initial proposed v2 model 0.71 -2e4

The performance improvements we see here due to better data (AUC of 0.7+ vs the 0.6 in your table) are almost certainly going to be shared by all the other algorithms.

2. Data cleanup

Now. If you choose to consider training/benchmarking with this higher-quality data, you'll run into a problem. Very often you have data where a user will fail a quiz, and then fail it again right away. Random example is https://huggingface.co/datasets/open-spaced-repetition/FSRS-Anki-20k/blob/main/dataset/1/30.csv (line 6886):

card_id review_th delta_t rating state duration delta_t_sec
462 2298 1 1 2 5514 84658
462 2305 0 1 3 7546 794
462 2330 0 1 3 5161 294
462 2341 0 2 3 3835 88

This is normal, my Anki data (from 2013~) has the same behavior: sometimes you just hit "fail" accidentally, or you didn't want to review your failure right then so you came back a few minutes later in the review. Anki doesn't have an easy way to "undo" an accidental fail review since it schedules future quizzes, so there's no incentive to keep quiz history clean.

But this hurts Ebisu a lot because the algorithm is very surprised to see failures so quickly after a review. In fact, giving fractional elapsed data can yield worse Ebisu performance (for the same models!) because, instead of you mapping "0 days" to 0.01 days = 14 minutes in your code, Ebisu gets the "real" elapsed time of, for example, 88 seconds 😖, and aggressively updates the model, thereby ruining it for future quizzes.

I've always handled this by

  1. group together quizzes that happen in the same four hour block
  2. keep only the first failure in the block (ignoring all successes before the first failure)
  3. and keep any successes after the last failure in that block

So if you saw this sequence of quiz results:

you'd collapse this to just B, E, and F (keep only the first failure and all successes after the last failure in the 4 hour window). Intuitively we ignore A because this chunk is a failure: the algorithm will be poisoned if you tell it there was a success 50 hours after last review and then a failure 5 minutes later. (Of course you convert all data to absolute time and recalculate the deltas after you decide which quizzes to keep.)

I picked this approach because of Anki's daily approach to quizzes: grouping together quizzes that happened within four hours ensures that we're well within the one-day quantization that Anki uses. This is ad hoc but I didn't tailor this behavior to work with this benchmark, I promise! Here's the same four hour interval code from 2021.

The above ROC curve is with this "cleanup" of the data applied. Here's the fsrs_anki_20k_reader.py helper I wrote to apply this idea to the FSRS dataset with fractional intervals.

You can argue that it's up to algorithms to handle this kind of behavior, that users of Anki and Anki-like apps shouldn't need to modify their behavior to prioritize the integrity of their quiz history. That may be a reasonable stance. But I feel low-parameter models like Ebisu shouldn't be punished for being unable to deal with messy data. A 300-parameter neural network has plenty of free parameters to learn to deal with Anki-specific complications: such a neural network could easily learn to group closely-spaced failures, and hedge its predictions for quizzes very close to failures, and other elaborate coping mechanisms to deal with integer-day intervals. Ebisu has zero excess parameters to learn such behavior. It relies on the data being clean from the start.

If you agree to run the benchmarks with fractional interval data, I hope you'll also consider some kind of cleanup step, to avoid telling algorithms that 2+ independent failures happened in rapid succession, and somehow "group" these together. My ad hoc approach outlined above is unlikely to be the best!

If you decide you don't want to consider fractional interval data and/or you don't want to do any cleanup, then maybe there should be a footnote, "The data includes back-to-back failures that we report to Ebisu as failures with 0.01 day intervals" or something sad like this 😢

3. Unbalanced data and log loss

Totally unrelated question here. When you optimize other models, do you run stochastic gradient descent to minimize log loss (cross entropy)? We've talked about this before in other issues but I think this is wrong because optimizing log-loss produces results that are skewed by unbalanced data (far more successes than failures). That's why I prefer to use focal loss, a very simple tweak on cross entropy, introduced by Facebook researchers here.

Specifically, in your example of searching for better Ebisu initial parameters, you found that a very long initial halflife results in best log-loss. However, this only happens because the sum of Bernoulli log likelihoods (cross entropy) is improved by making the model overconfident on the larger set of quiz successes at the expense of ruining performance on the smaller set of quiz failures. This is absolutely not the behavior we want: if anything we want a model that prioritizes getting failures correct and is willing to sacrifice performance on successes.

If you instead optimize focal loss (or some other parameter that handles unbalanced training data), you'll get different optima. When I run a grid search of Ebisu v2 on focal loss for just my Anki quiz data (370 flashcards), I see that the best-performing Ebisu v2 model happens at halflife much smaller than 400 days (actually around 150 hours), because the metric isn't impressed by easily-classified successes.

focal-loss-ebisu2

(The AUC for the above grid of Ebisu v2 parameters also peaks far away from your "400 days optimum". And I'm starting to think this is what I want to optimize rather than focal loss, since I really really don't care about how accurate predictRecall is, I care a lot more about the relative ranking of predictRecalls called on different cards, and AUC is "what is the likelihood that an average successful quiz got higher predictRecall than an average failure quiz".)

auc-ebisu2

Since both AUC and focal loss both peak at close-ish parameters, I think that emphasizes the importance of dealing with training imbalance.

4. AUC vs "true positive rate at 20% false positive rate"

Overall, I'm not happy with AUC in the 70% range. In the ROC curves I shared above, the best-performing models achieved only 50% true positive rate (TPR) for a 20% false positive rate (FPR). That is, if we threshold the predicted recall probability such that 20% of the quizzes we predicted to succeed actually failed, then only 50% of the successful quizzes are predicted to have succeed. This is a miserable classifier 😭 I would prefer to see 80%, 90% TPR at 20% FPR 😕.

I think it'd be really interesting to compare all the algorithms against not just AUC (area under the entire curve) but also the TPR at 20% FPR? If two algorithms both have the same AUC, their TPR at 20% FPR would be a great way to further compare their ranking capacity. The one with better TPR at 20% FPR would be better in that, assuming we were OK with a 20% FPR, that algorithm predicted more successful quizzes as successful.

(And I'm not saying 20% FPR is tolerable. Depending on the distribution of failed quizzes, maybe TPR at 10% FPR is what we should optimize. 10% FPR means 10% of the failed quizzes we incorrectly predicted to be successful 😫 that's still so high it hurts my soul.)

Thanks for reading, sorry for so many questions and ideas.

Originally posted by @fasiha in https://github.com/open-spaced-repetition/srs-benchmark/issues/112#issuecomment-2367018931


If that's the case, I recommend you publish two sets of tables: in addition to the benchmarks you already have, consider a second set, giving all algorithms the fractional time between reviews.

I'm considering it. In fact, I have benchmarked two models with fractional time.

But this hurts Ebisu a lot because the algorithm is very surprised to see failures so quickly after a review

I think it's a weakness of Ebisu, instead of a problem of the data. And you cannot prevent users from generating unclean data in practice. So it should be handled by the algorithm.

That's why I prefer to use focal loss, a very simple tweak on cross entropy, introduced by Facebook researchers here.

I knew it, but it will skew the algorithm to underestimate the stability. Because focal loss tend to give more weights to failed reviews, which will induce the algorithm to decrease the stability. Then the user will have higher true retention than their desired retention.

Originally posted by @L-M-Sherlock in https://github.com/open-spaced-repetition/srs-benchmark/issues/112#issuecomment-2367109125

Expertium commented 2 months ago

In fact, I have benchmarked two models with fractional time.

You haven't published the results yet, right?

L-M-Sherlock commented 2 months ago

Yeah. I just stored the results in the repo. I haven't added them into the table. And they need a new table.

Expertium commented 2 months ago

@fasiha we need a short 1-2 sentences summary of Ebisu, and a link to the repo or something like this to use as a reference to your work.

user1823 commented 2 months ago

I think it's a weakness of Ebisu, instead of a problem of the data. And you cannot prevent users from generating unclean data in practice. So it should be handled by the algorithm.

Apps can filter the data before they provide it to the algorithm. In practice, there is no real difference whether the data is filtered by the app itself or by the algorithm.

L-M-Sherlock commented 2 months ago

Apps can filter the data before they provide it to the algorithm.

You're right. But apps may use very different methods to filter the data. In the benchmark, I hope we can make sure we are benchmarking the best practice. If ebisu could integrate the filter into itself, it will be much more convenient.

user1823 commented 2 months ago

the sum of Bernoulli log likelihoods (cross entropy) is improved by making the model overconfident on the larger set of quiz successes at the expense of ruining performance on the smaller set of quiz failures. This is absolutely not the behavior we want: if anything we want a model that prioritizes getting failures correct and is willing to sacrifice performance on successes.

It seems that I am experiencing this with FSRS. Every time I re-optimize my parameters, the 4 initial stabilities increase.

L-M-Sherlock commented 2 months ago

Every time I re-optimize my parameters, the 4 initial stabilities increase.

Did you draw the first forgetting curves? I guess they also become flat.

user1823 commented 2 months ago

Yes, the forgetting curve seems to be flat, based on this data:

first_rating delta_t y_mean y_count
1 1 0.9365 4473
3 1 0.9832 3631
3 2 0.9886 614
3 3 0.9789 285
3 4 0.9799 298
3 5 0.9939 163
3 8 0.9794 194
3 9 0.9774 310
3 10 0.9786 513
3 11 0.9631 379
3 12 0.9703 438
3 13 0.9551 401
3 14 0.9712 555
3 17 0.9760 167
4 19 0.9565 23
4 20 1 28
4 21 1 50
4 22 1 43
4 23 1 45
4 24 0.9688 64
4 25 0.9259 27
4 27 1 12
4 28 1 10
4 32 1 7
4 37 0.8333 6
4 38 0.9091 11
4 40 0.8571 7
4 41 1 6
4 45 1 6

By the way, I think that there should be a new issue if you want to discuss this further.

Expertium commented 1 month ago

@L-M-Sherlock @fasiha I benchmarked FSRS withh focal loss with different combinations of alpha and gamma. None of them improved RMSE or AUC. image

Note that alpha=0.5 and gamma=0 is equivalent to logloss/2

fasiha commented 1 month ago

Thanks for digging into this! A few comments.

§1 In cross-entropy, α (the balance/imbalance ratio) can have a big impact. But in focal loss, in both my experiments and according to the Facebook paper, α barely changes the result of the optimization, so you can set α=0.5.

§2 Therefore, the fact that varying α appears to modify AUC is very very surprising to me. Seeing that the AUC for the γ=0.25 column varies between 0.7 and 0.66 as α changes, and that this is more variability as γ is varied from 0 to 3 (AUC 0.71 to 0.68), this makes me extremely confused. I wonder if there's some kind of bug?

§3 More generally, note that varying γ changes the function being optimized. I would be pretty surprised if optimizing with γ=0 results in the same parameters as optimizing with γ=2. And therefore I would be similarly surprised if AUC and other metrics stay the same as the optimal parameters change.

For example, Ebisu makes it easy: there's just two parameters so we can visualize the curve being optimized. The following three plots show this cost surface for γ=0.1, 1, and 2 as we vary the two parameters. The maximum moves towards the lower-right as γ increases, resulting in significantly different behavior predicting recall probabilities (and different AUC).

In general, if the parameters that minimize focal loss at γ=0.1 and γ=1 and γ=2 all result in very similar AUC… mayyybe this is plausible 🤷: maybe the AUC is not that sensitive to parameters? But I would expect some change in AUC as you vary γ, and moreover I'd expect that change to be much larger than the change in AUC you get by simply varying α.

gamma0 1

gamma1 0

gamma2 0

Let me know if the above makes sense.

fasiha commented 1 month ago

I think it's a weakness of Ebisu, instead of a problem of the data. And you cannot prevent users from generating unclean data in practice. So it should be handled by the algorithm.

I can sort of see what you mean here. Ideally, we prefer algorithms that are robust to noise: of course we prefer algorithms that can handle low-SNR CT scans. But having said that, leaving incorrect/duplicated failures in our training data seems less like noise and more like errors. It feels like duplicating copies of the same healthy CT scans and labeling them with "cancer".

Shifting the burden of dealing with this kind of pathological data to algorithms instead of cleaning up the data seems questionable? I know I'm not an unbiased opinion, since I'm the author of a low-parameter algorithm, but I think a lot of reasonable people will look at this data and agree that every algorithm is hurt by including back-to-back failures and by quantizing elapsed time to integer days. Doing this is only reasonable because Anki has normalized these unfortunate design choices.

In the benchmark, I hope we can make sure we are benchmarking the best practice.

I'm glad we want to establish some best practices. The data cleanup I sketched is a totally ad hoc approach I've used for years to clean up my old Anki data. I'm sure that you can improve on it, given your greater familiarity with Anki.

If ebisu could integrate the filter into itself, it will be much more convenient

As I said earlier, this affects all algorithms. Some (many?) parameters of your 300-parameter neural network are being wasted when trained on such questionable data (duplicated failures, integer days). In order to promote best practices, I think it'd make a lot more sense to standardize the data going into all algorithms, rather than encouraging each algorithm to figure out its own unique approach to dealing with these problems?

(And just to emphasize: I'm proposing these changes (data cleanup, data enhancement with fractional days, focal loss) not because I think they'll help Ebisu—it could very well be that Ebisu performance remains totally lackluster after all this! I've just been trying to understand how this benchmark works and how to make it make more sense to me.)

Expertium commented 1 month ago

I wonder if there's some kind of bug?

Here's my implementation:

class FocalLoss(nn.Module):
    def __init__(self):
        super(FocalLoss, self).__init__()

    def forward(self, predictions: Tensor, labels: Tensor) -> Tensor:
        # hyper-parameters
        alpha = 0.5
        gamma = 0.25

        p_t = torch.where(labels == 1, predictions, 1 - predictions)
        p_t = p_t.clamp(0.0001, 0.9999)
        alpha_t = torch.where(labels == 1, alpha, 1 - alpha)
        focal_loss = -alpha_t * torch.pow((1 - p_t), gamma) * torch.log(p_t)
        return focal_loss.clamp(0, 100)
L-M-Sherlock commented 1 month ago

PyTorch has an official implementation:

https://pytorch.org/vision/main/generated/torchvision.ops.sigmoid_focal_loss.html?highlight=focal_loss#torchvision.ops.sigmoid_focal_loss

fasiha commented 1 month ago

Yah you need a second term that adds the 1 - p_t case for failures. Here's my implementation:

def bernoulliLogProbabilityFocal(result: bool, p: float, gamma: float = 2, alpha=0.5) -> float:
  assert 0 <= p <= 1
  assert 0 <= gamma
  assert 0 < alpha < 1
  focalP = p**((1 - p)**gamma)
  focalQ = (1 - p)**(p**gamma)
  return log(alpha * focalP if result else (1 - alpha) * focalQ)

but definitely use PyTorch's operator.

Expertium commented 1 month ago

I get an error when I try to run this:

from torchvision.ops import sigmoid_focal_loss

class FocalLoss(nn.Module):
    def __init__(self):
        super(FocalLoss, self).__init__()

    def forward(self, predictions: Tensor, labels: Tensor) -> Tensor:
        # hyper-parameters
        alpha = 0.75
        gamma = 0.5

        return sigmoid_focal_loss(predictions, labels, alpha, gamma)
Expertium commented 1 month ago

@fasiha in case you missed my comment here: https://github.com/open-spaced-repetition/srs-benchmark/issues/124#issuecomment-2367901327

we need a short 1-2 sentences summary of Ebisu, and a link to the repo or something like this to use as a reference to your work.