open-spaced-repetition / fsrs4anki

A modern Anki custom scheduling based on Free Spaced Repetition Scheduler algorithm
https://github.com/open-spaced-repetition/fsrs4anki/wiki
MIT License
2.74k stars 133 forks source link

[Feature Request] FSRS4Anki Optimizer 4.0 Beta #342

Closed L-M-Sherlock closed 1 year ago

L-M-Sherlock commented 1 year ago

Background

Baseline

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/baseline-3.26.1.ipynb

Candidate

Idea Effect Comment
power forgetting curve Positive more accurate for heterogeneous reviews
S0 curve fit Positive more accurate initial stability
post-lapse stability offset Null it could prevent stability increasing after lapse
power difficulty Negative
adaptive grade for difficulty Null
grade-derived R

Link: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/new-baseline/candidate

Note

I plan to re-evaluate these candidate ideas one by one before we integrate them into Beta version.

Expertium commented 1 year ago

I have tested the power vs exp curve with 4.0 Alpha before, but I will re-test it with the mini-batch. I will also test power difficulty. Adaptive grades have been tested already and at this point we all know they work, so I don't think re-testing them is necessary. Grade-derived R has not been implemented yet, so I will wait until it is. I don't have any other good ideas. I've tested a lot more, but they weren't good.

L-M-Sherlock commented 1 year ago

Grade-derived R has not been implemented yet, so I will wait until it is.

Please check my latest comment: https://github.com/open-spaced-repetition/fsrs4anki/issues/339#issuecomment-1625128974

L-M-Sherlock commented 1 year ago

Grade-derived R has not been implemented yet, so I will wait until it is.

I update the code:

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/grade-derived_R.ipynb

Expertium commented 1 year ago

1) Please add a graph (average R on the y axis, average grade on the x axis) with raw data and the approximation to the part of the code that estimates grade-derived R. 2) Add clampings for weights:

            w[13] = w[13].clamp(0, 1)
            w[14] = w[14].clamp(0, 1)
            w[15] = w[15].clamp(0, 1)
            w[16] = w[16].clamp(0, 1)

3) Add this

            r_w = weights * r_grade + (1 - weights) * r_theoretical
            r_w = r_w.clamp(0.0001, 0.9999)
            s_w = X[:,0] * torch.log(0.9)/torch.log(r_theoretical)
            new_s = torch.where(condition, self.stability_after_success(s_w, new_d, r_w), self.stability_after_failure(s_w, new_d, r_w))

Just fix the error that appears if you pass a float to torch.log.

L-M-Sherlock commented 1 year ago

s_w = X[:,0] * torch.log(0.9)/torch.log(r_theoretical)

Should it be r_theoretical? I think it is state[:,0] = state[:,0] * torch.log(r_theoretical) / torch.log(r_w).

Expertium commented 1 year ago

My bad, you're right.

L-M-Sherlock commented 1 year ago

Here is the graph:

image

For more details, please check the link I pasted before.

Expertium commented 1 year ago

Wait, was filter_out_suspended_cardsTrue or False in the previous versions of the algorithm? I need it to be consistent with versions that didn't have this option.

user1823 commented 1 year ago

Wait, was filter_out_suspended_cardsTrue or False in the previous versions of the algorithm? I need it to be consistent with versions that didn't have this option.

It was False.

Expertium commented 1 year ago

Here are the results. 1) The power curve definitely improves results; I even added a new column p<0.001 because of it. S0 curve fit is statistically significant at that level too; in other words, observing such a difference in performance would be extremely unlikely if S0 curve_fit wasn't affecting accuracy. 2) Grade-derived R, on the other hand, was pretty underwhelming. The results are still statistically significant at p<0.05 (which obviously isn't as impressive as p<0.001), and the magnitude of the effect isn't very big. Perhaps there is a better way of doing it, but I can't think of a better way. I'll leave it to Sherlock to decide whether we should implement grade-derived R in the final version. It provides as small benefit at the cost of adding 8 new parameters. 3) Power difficulty, surprisingly, made performance worse, even though it has 1 extra parameter, and the way SInc decreases as D increases feels more natural. A linear decrease with increasing D just doesn't feel like this is how things actually work, human memory rarely does things linearly. And yet the old formula somehow was better. image

L-M-Sherlock commented 1 year ago

Link: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/new-baseline/candidate

Could you help me test all candidates?

Expertium commented 1 year ago

Yep, I'll do it today.

Expertium commented 1 year ago

Your implementation of adaptive grades needs some changes. Add clampings (0, 10) and make sure that you subtract (Hard+Good)/2 in formulas for D. I mentioned this in another issue, it's more intuitive that way since it makes it so that Again and Hard increase D, Good and Easy decrease it. It will have no effect on performance, it's just to make things more intuitive. And in the formula for initial S subtract the value for Again.

L-M-Sherlock commented 1 year ago

Your implementation of adaptive grades needs some changes. Add clampings (0, 10) and make sure that you subtract (Hard+Good)/2 in formulas for D. I mentioned this in another issue, it's more intuitive that way since it makes it so that Again and Hard increase D, Good and Easy decrease it. It will have no effect on performance, it's just to make things more intuitive.

If Good decreases the difficulty, it is likely to counteract or even override the stability decay.

And what if the user never use Hard?

Expertium commented 1 year ago

When I was testing another user's idea I found that subtracting 2.5 (which is (Hard+Good)/2) instesd of 3 had literally no effect. EDIT: and you still should change formulas for initial S and D, right now they are using 1 and 3 instead of parameters.

L-M-Sherlock commented 1 year ago

EDIT: and you still should change formulas for initial S and D, right now they are using 1 and 3 instead of parameters.

But if we combine it with S0 curve fit, the initial S is independent with adaptive grade.

Expertium commented 1 year ago

Yes, of course in the final version we will use S0 curve_fit. But this is not the final version. EDIT: maybe we can let initial S use 1, 2, 3, 4 but initial D definitely should use parameters

Expertium commented 1 year ago

Ok, so here's what I want you to do

In both formulas for D replace 3 with (Hard+Good)/2, but leave the formula for initial S unchanged.

L-M-Sherlock commented 1 year ago

EDIT: maybe we can let initial S use 1, 2, 3, 4 but initial D definitely should use parameters

I tested it, but it caused weird problem. The initial difficulty for again is less than the difficulty for hard.

Expertium commented 1 year ago

Was it on your collection?

L-M-Sherlock commented 1 year ago

Yep.

Expertium commented 1 year ago

Hmm. So you think we should use 1, 2, 3, 4 for initial D, and use opt. grades only in the second formula for D?

L-M-Sherlock commented 1 year ago

Hmm. So you think we should use 1, 2, 3, 4 for initial D, and use opt. grades only in the second formula for D?

Right. You can check the notebook for details about the optimization on my collection.

Expertium commented 1 year ago

In that case you should still change (X_rating - 3) in the second formula to (X_rating - self.w[...]) Just to prove a point, I will test both (X_rating - Good) and (X_rating - (Hard+Good)/2)

L-M-Sherlock commented 1 year ago

In that case you should still change (X_rating - 3) in the second formula to (X_rating - self.w[...])

OK, I have updated it just now.

Expertium commented 1 year ago

So I've only tested optimizable grades, but the results are very surprising. Optimizable grades have NO effect on RMSE, unless they are in the formula for initial S. Having optimizable grades in D but not in the initial S does nothing. Having them only in the initial S improves the results a little bit. Having them both in the initial S and D improves produces about the same results as having them only in the initial S. But on their own, optimizable grades don't do anything. image

I vaguely remember that we encountered this problem all the way back when I first introduced my idea of optimizable grades, but we brushed it off and added opt. grades to initial S without thinking about it too much. image Back then I actually tested having 2 separate sets of parameters, for initial S and for D (8 in total), and yet these results didn't make me suspicious. That means that no changes to D improve RMSE. Optimizable grades do not improve RMSE, power D instead of linear doesn't improve RMSE (it makes it worse), and adding R to D doesn't improve RMSE either. So either the optimizer can't find good values for some reason or our definition of D is completely terrible. Or both.  There are two other ways (that I can think of) to define D, but neither are great. 1) I've experimented with B-W based difficulty, and it's worse no matter how I tweak it. 2) "Best fit" difficulty obtained via a mini-optimization process would be great, but it would slow the "outer" optimization significantly since after every single review an "inner" optimization would have to be performed to find the optimal value of D. And even if we manage to implement that in the optimizer, it's likely impossible (and very costly if possible) to do it in the scheduler. Even if we simplified the procedure greatly (I have an idea how), it would still be difficult, and most likely not doable in the scheduler since it would involve more than the most recent repetition.

@L-M-Sherlock, just to confirm, the scheduler only has access to the most recent repetition, not the card's entire history? Only the helper add-on can read the entire history? If the answer is "Yes", then my idea of "best fit" D cannot be implemented. If there is a way for the scheduler to "see" more than one repetition, then with a lot of simplification, it's possible to pull my idea off. I will test other candidates later.

Expertium commented 1 year ago

Also, Sherlock, for testing purposes I want you to make a file where first the parameters are optimized and grades are fixed (1, 2, 3, 4), and then a second optimization will run where only grades are optimized. I wonder if optimizing grades after all other parameters have been optimized will produce better results. Maybe. Somehow.

L-M-Sherlock commented 1 year ago

just to confirm, the scheduler only has access to the most recent repetition, not the card's entire history? Only the helper add-on can read the entire history?

I'm very sure that the scheduler could only has access to the memory states and the info about current repetition.

L-M-Sherlock commented 1 year ago

Having optimizable grades in D but not in the initial S does nothing. Having them only in the initial S improves the results a little bit. Having them both in the initial S and D improves produces about the same results as having them only in the initial S. But on their own, optimizable grades don't do anything.

Fine. I decide to remove it from Beta. Here are some ideas that will be integrated into Beta:

  1. S0 curve fit
  2. Power forgetting curve
  3. post-lapse stability offset
  4. 5 splits & 5 epochs

Here is a preliminary implementation: https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/fsrs4anki_optimizer_beta.ipynb

Expertium commented 1 year ago

1) The power forgetting curve outperformed the hybrid curve for every single deck and every single user. What surprised me was the fact that they both reached very high levels of statistical significance. 2) PLS had no impact on performance. image

I would like to test grade-derived R, but before that I want @L-M-Sherlock to give the R-matrix one last try. If it works, then grade-derived R is not necessary.

L-M-Sherlock commented 1 year ago

I would like to test grade-derived R, but before that I want @L-M-Sherlock to give the R-matrix one last try. If it works, then grade-derived R is not necessary.

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/R-Matrix.ipynb

I attempt to integrate R-Matrix into the training process. But it still has some problems.

Edit: And it is very slow, because in each step, FSRS needs to query R-Matrix.

Expertium commented 1 year ago

Change weight to this: weight = 0.9 * torch.pow(torch.tensor(count / 300), 0.5) And remove w[13] and w[14], I want to see how it works with fixed parameters. Also, add something like count = count.map(lambda x: min(x, 300)).to_numpy() or count = min(count, 300) or something like that. I tried that, but I just keep getting frustrated because even the simplest changes like that are incredibly unintuitive in pytorch.

user1823 commented 1 year ago

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/R-Matrix.ipynb

It was the worst modification I had seen yet. Not only it was very slow, but it also produced the worst RMSE I had seen yet.

RMSE with R-matrix.ipynb: 0.0536 RMSE with baseline: 0.0184

Note: I have not tested the latest update to R-matrix.ipynb.

Expertium commented 1 year ago

Ok, I figured out how to fix the issue with count count = count.clamp(0, 300) I used fixed w[13] and w[14] and tested it on one deck. I also tested the most recent version of R-Matrix.ipynb (with the fix for the count, as above), with optimizable weights. As user1823 said, it's slow and, but it is more accurate for me, at least based on a sample size of one, lol. But it's not that much more accurate for user1823, although it's also not as bad as he said, I don't know how he got RMSE>0.05. By the way, ideally RMSE should be calculated using weighted R, since that's what will be used in all other calculations. image

Expertium commented 1 year ago

I suggest discussing the R-matrix in that issue, let's keep the current issue for other things, at least for now.

Expertium commented 1 year ago

Sherlock, before I test grade-derived_R.ipynb, I want you to add one little change: here: grade_derived_r = dict(zip([1, 2, 3, 4], func(np.array([1, 2, 3, 4]), *params))) The issue with my fitting function is that it might output negative values. It's unlikely, but possible. Add something like max(func(1), 0)

L-M-Sherlock commented 1 year ago

Sherlock, before I test grade-derived_R.ipynb, I want you to add one little change: here: grade_derived_r = dict(zip([1, 2, 3, 4], func(np.array([1, 2, 3, 4]), *params))) The issue with my fitting function is that it might output negative values. It's unlikely, but possible. Add something like max(func(1), 0)

OK. Please check the latest version.

Expertium commented 1 year ago

Also, this part is suspicious

grade_means = grade_r_matrix['r']['mean'].sort_values()
r_means = grade_r_matrix['y']['mean'].sort_values()
count = grade_r_matrix['y']['count'].sort_values()

Doesn't it break the correspondence between them? See example below: image

Expertium commented 1 year ago

Here are the results of testing the candidates, excluding grade-derived R as I still believe it has a bug image

1) Adaptive grades make no difference. Replacing (rating-3) with (rating-2.5) makes no difference either, so I suggest we use it instead simply because it's more intuitive.

Currently, when we use (rating-3) in formulas for D: Again = D increases Hard = D increases Good = D doesn't change Easy = D decreases

If we use (rating-2.5): Again = D increases Hard = D increases Good = D decreases Easy = D decreases As you can see, it's more intuitive. And since it doesn't affect performance, we lose nothing by implementing it.

2) S0 curve_fit showed a very consistent improvement. The difference wasn't so big for (subjectively) hard decks, but it was huge for (subjectively) easy decks. Sherlock was the only one who didn't benefit from it for some reason, it made RMSE very slightly worse for his collection. It worked really well on all my decks + user8123's collection + nb9618's collection + JeffersonCorreia's collection.

3) Power curve outperformed the hybrid curve for all decks and all users.

4) PLS did not affect performance.

5) Power curve for D made it worse.

So so far we only have 2 good ideas, S0 curve_fit and power forgetting curve. Minus weighted loss aka "best epoch", minus 5 splits 5 epochs, because those two changes are related to the optimizer, not the algorithm itself. Assuming that the benefits (decrease relative to baseline) are multiplicative, S0 curve_fit + power forgetting curve combined should result in RMSE roughly 0.81*0.86=0.7 that of the baseline. So if baseline had RMSE=0.1, the new version would have RMSE=0.07, for example. This is not enough to warrant releasing a new version, we need more.

L-M-Sherlock commented 1 year ago

Doesn't it break the correspondence between them? See example below:

Yep, it is incorrect. I will fix it today.

AkiraChisaka commented 1 year ago

I just want to leave a comment here to thank everyone pouring so much time and attention into this project to try to figure stuff like those out!

I wanted to help too, but I feel like I couldn't understand a single line in this post lol.

Expertium commented 1 year ago

Interestingly, grade-derived R was actually performing better when it was bugged. In any case, it's just not worth it. Even the bugged version that performed better wasn't that much better, and it introduces 8 new parameters. It's very inefficient, in the improvement/parameters sense. We could tweak it a bit more (for example, grouping reviews into categories could be done better), but I doubt it would be worth it. So yeah, only 2 good ideas so far. image

user1823 commented 1 year ago

Assuming that the benefits (decrease relative to baseline) are multiplicative, S0 curve_fit + power forgetting curve combined should result in RMSE roughly 0.81*0.86=0.7 that of the baseline. This is not enough to warrant releasing a new version, we need more.

If you ask me, 30% reduction in RMSE is a lot. I think that we should not further delay the release of this version unless any one of the following holds:

Further, if we maintain backward compatibility with FSRS v3, the upgrade would not be a major issue for the users. This is because it would allow the users to upgrade when they want rather than forcing them to upgrade at the moment we release the update.

Expertium commented 1 year ago
  • You think that we will very soon discover some way to further improve the algorithm (such as improving the formula for D).

Perhaps. Sherlock will be working on my idea with best fit D, which we will then try to approximate, and I'm still trying to think of ways to improve the algorithm; I want to decrease RMSE by at least 2x in the new version.

L-M-Sherlock commented 1 year ago

I have found a potential bug for the Beta version. When the user doesn't have enough reviews for all first ratings, the fit_curve will raise a error.

Expertium commented 1 year ago

You mean when the counts for all ratings are below 100? Well, that's just something that has to be accepted. Users need at least a thousand or two reviews, otherwise the results won't be accurate. If anything, it's better than letting the user run the optimizer with very few reviews and getting inaccurate results.

user1823 commented 1 year ago

I have found a potential bug for the Beta version. When the user doesn't have enough reviews for all first ratings, the fit_curve will raise a error.

In this case, just raise a value error like the following:

https://github.com/open-spaced-repetition/fsrs4anki/blob/6066c7174dfefa4ac4cb930dcb70063b906cb6c8/package/fsrs4anki_optimizer/fsrs4anki_optimizer.py#L112

The reason is the same as that pointed out by Expertium. We should ensure that the reviews are sufficient to produce accurate results. Otherwise, there is no point in running the optimizer.

Expertium commented 1 year ago

I suppose we could add Easy Bonus to the Beta version, though I'm still trying to think of a way to turn it into a function rather than a constant.

    def stability_after_success(self, state: Tensor, new_d: Tensor, r: Tensor, X: Tensor) -> Tensor:
        easy_bonus_condition = X[:,1] == 4
        easy_bonus = torch.where(easy_bonus_condition, self.w[13], 1)
        new_s = state[:,0] * (1 + torch.exp(self.w[6]) *
                        (11 - new_d) *
                        torch.pow(state[:,0], self.w[7]) *
                        (torch.exp((1 - r) * self.w[8]) - 1) * easy_bonus)
        return new_s

w[13] = w[13].clamp(1, 10)

Set the initial value to 2.

Expertium commented 1 year ago

I tried a very similar idea, but with hard.

    def stability_after_success(self, state: Tensor, new_d: Tensor, r: Tensor, X: Tensor) -> Tensor:
        hard_punishment_condition = X[:,1] == 2
        hard_punishment = torch.where(hard_punishment_condition, self.w[13], 1)
        new_s = state[:,0] * (1 + torch.exp(self.w[6]) *
                        (11 - new_d) *
                        torch.pow(state[:,0], self.w[7]) *
                        (torch.exp((1 - r) * self.w[8]) - 1) * hard_punishment)
        return new_s

w[13] = w[13].clamp(0.1, 1)

Set the initial value to 0.5.

Sorry for the awkward way of presenting the results, I couldn't think of a way to fit both of them neatly into a table. While I'm not against implementing both Easy Bonus and Hard Punishment, I hope that in the future we will find more general and flexible solutions. Also, while the constants do improve RMSE for their respective grades, the overall RMSE isn't improved much. Together they will improve the overall RMSE only by 4-5%. image

Expertium commented 1 year ago

S0 curve_fit + power forgetting curve + Easy Bonus + Hard Punishment + 5 epochs, 5 splits should result in RMSE decreasing to around 0.63-0.64 of the baseline. If we can come up with one more big change, we could maybe drop it to 0.5 (or 2x decrease).