Closed L-M-Sherlock closed 1 year ago
I have tested the power vs exp curve with 4.0 Alpha before, but I will re-test it with the mini-batch. I will also test power difficulty. Adaptive grades have been tested already and at this point we all know they work, so I don't think re-testing them is necessary. Grade-derived R has not been implemented yet, so I will wait until it is. I don't have any other good ideas. I've tested a lot more, but they weren't good.
Grade-derived R has not been implemented yet, so I will wait until it is.
Please check my latest comment: https://github.com/open-spaced-repetition/fsrs4anki/issues/339#issuecomment-1625128974
Grade-derived R has not been implemented yet, so I will wait until it is.
I update the code:
1) Please add a graph (average R on the y axis, average grade on the x axis) with raw data and the approximation to the part of the code that estimates grade-derived R. 2) Add clampings for weights:
w[13] = w[13].clamp(0, 1)
w[14] = w[14].clamp(0, 1)
w[15] = w[15].clamp(0, 1)
w[16] = w[16].clamp(0, 1)
3) Add this
r_w = weights * r_grade + (1 - weights) * r_theoretical
r_w = r_w.clamp(0.0001, 0.9999)
s_w = X[:,0] * torch.log(0.9)/torch.log(r_theoretical)
new_s = torch.where(condition, self.stability_after_success(s_w, new_d, r_w), self.stability_after_failure(s_w, new_d, r_w))
Just fix the error that appears if you pass a float to torch.log.
s_w = X[:,0] * torch.log(0.9)/torch.log(r_theoretical)
Should it be r_theoretical
? I think it is state[:,0] = state[:,0] * torch.log(r_theoretical) / torch.log(r_w)
.
My bad, you're right.
Here is the graph:
For more details, please check the link I pasted before.
Wait, was filter_out_suspended_cards
True or False in the previous versions of the algorithm? I need it to be consistent with versions that didn't have this option.
Wait, was
filter_out_suspended_cards
True or False in the previous versions of the algorithm? I need it to be consistent with versions that didn't have this option.
It was False
.
Here are the results. 1) The power curve definitely improves results; I even added a new column p<0.001 because of it. S0 curve fit is statistically significant at that level too; in other words, observing such a difference in performance would be extremely unlikely if S0 curve_fit wasn't affecting accuracy. 2) Grade-derived R, on the other hand, was pretty underwhelming. The results are still statistically significant at p<0.05 (which obviously isn't as impressive as p<0.001), and the magnitude of the effect isn't very big. Perhaps there is a better way of doing it, but I can't think of a better way. I'll leave it to Sherlock to decide whether we should implement grade-derived R in the final version. It provides as small benefit at the cost of adding 8 new parameters. 3) Power difficulty, surprisingly, made performance worse, even though it has 1 extra parameter, and the way SInc decreases as D increases feels more natural. A linear decrease with increasing D just doesn't feel like this is how things actually work, human memory rarely does things linearly. And yet the old formula somehow was better.
Link: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/new-baseline/candidate
Could you help me test all candidates?
Yep, I'll do it today.
Your implementation of adaptive grades needs some changes. Add clampings (0, 10) and make sure that you subtract (Hard+Good)/2 in formulas for D. I mentioned this in another issue, it's more intuitive that way since it makes it so that Again and Hard increase D, Good and Easy decrease it. It will have no effect on performance, it's just to make things more intuitive. And in the formula for initial S subtract the value for Again.
Your implementation of adaptive grades needs some changes. Add clampings (0, 10) and make sure that you subtract (Hard+Good)/2 in formulas for D. I mentioned this in another issue, it's more intuitive that way since it makes it so that Again and Hard increase D, Good and Easy decrease it. It will have no effect on performance, it's just to make things more intuitive.
If Good decreases the difficulty, it is likely to counteract or even override the stability decay.
And what if the user never use Hard?
When I was testing another user's idea I found that subtracting 2.5 (which is (Hard+Good)/2) instesd of 3 had literally no effect. EDIT: and you still should change formulas for initial S and D, right now they are using 1 and 3 instead of parameters.
EDIT: and you still should change formulas for initial S and D, right now they are using 1 and 3 instead of parameters.
But if we combine it with S0 curve fit, the initial S is independent with adaptive grade.
Yes, of course in the final version we will use S0 curve_fit. But this is not the final version. EDIT: maybe we can let initial S use 1, 2, 3, 4 but initial D definitely should use parameters
Ok, so here's what I want you to do
In both formulas for D replace 3 with (Hard+Good)/2, but leave the formula for initial S unchanged.
EDIT: maybe we can let initial S use 1, 2, 3, 4 but initial D definitely should use parameters
I tested it, but it caused weird problem. The initial difficulty for again is less than the difficulty for hard.
Was it on your collection?
Yep.
Hmm. So you think we should use 1, 2, 3, 4 for initial D, and use opt. grades only in the second formula for D?
Hmm. So you think we should use 1, 2, 3, 4 for initial D, and use opt. grades only in the second formula for D?
Right. You can check the notebook for details about the optimization on my collection.
In that case you should still change (X_rating - 3) in the second formula to (X_rating - self.w[...]) Just to prove a point, I will test both (X_rating - Good) and (X_rating - (Hard+Good)/2)
In that case you should still change (X_rating - 3) in the second formula to (X_rating - self.w[...])
OK, I have updated it just now.
So I've only tested optimizable grades, but the results are very surprising. Optimizable grades have NO effect on RMSE, unless they are in the formula for initial S. Having optimizable grades in D but not in the initial S does nothing. Having them only in the initial S improves the results a little bit. Having them both in the initial S and D improves produces about the same results as having them only in the initial S. But on their own, optimizable grades don't do anything.
I vaguely remember that we encountered this problem all the way back when I first introduced my idea of optimizable grades, but we brushed it off and added opt. grades to initial S without thinking about it too much. Back then I actually tested having 2 separate sets of parameters, for initial S and for D (8 in total), and yet these results didn't make me suspicious. That means that no changes to D improve RMSE. Optimizable grades do not improve RMSE, power D instead of linear doesn't improve RMSE (it makes it worse), and adding R to D doesn't improve RMSE either. So either the optimizer can't find good values for some reason or our definition of D is completely terrible. Or both. There are two other ways (that I can think of) to define D, but neither are great. 1) I've experimented with B-W based difficulty, and it's worse no matter how I tweak it. 2) "Best fit" difficulty obtained via a mini-optimization process would be great, but it would slow the "outer" optimization significantly since after every single review an "inner" optimization would have to be performed to find the optimal value of D. And even if we manage to implement that in the optimizer, it's likely impossible (and very costly if possible) to do it in the scheduler. Even if we simplified the procedure greatly (I have an idea how), it would still be difficult, and most likely not doable in the scheduler since it would involve more than the most recent repetition.
@L-M-Sherlock, just to confirm, the scheduler only has access to the most recent repetition, not the card's entire history? Only the helper add-on can read the entire history? If the answer is "Yes", then my idea of "best fit" D cannot be implemented. If there is a way for the scheduler to "see" more than one repetition, then with a lot of simplification, it's possible to pull my idea off. I will test other candidates later.
Also, Sherlock, for testing purposes I want you to make a file where first the parameters are optimized and grades are fixed (1, 2, 3, 4), and then a second optimization will run where only grades are optimized. I wonder if optimizing grades after all other parameters have been optimized will produce better results. Maybe. Somehow.
just to confirm, the scheduler only has access to the most recent repetition, not the card's entire history? Only the helper add-on can read the entire history?
I'm very sure that the scheduler could only has access to the memory states and the info about current repetition.
Having optimizable grades in D but not in the initial S does nothing. Having them only in the initial S improves the results a little bit. Having them both in the initial S and D improves produces about the same results as having them only in the initial S. But on their own, optimizable grades don't do anything.
Fine. I decide to remove it from Beta. Here are some ideas that will be integrated into Beta:
Here is a preliminary implementation: https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/fsrs4anki_optimizer_beta.ipynb
1) The power forgetting curve outperformed the hybrid curve for every single deck and every single user. What surprised me was the fact that they both reached very high levels of statistical significance. 2) PLS had no impact on performance.
I would like to test grade-derived R, but before that I want @L-M-Sherlock to give the R-matrix one last try. If it works, then grade-derived R is not necessary.
I would like to test grade-derived R, but before that I want @L-M-Sherlock to give the R-matrix one last try. If it works, then grade-derived R is not necessary.
https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/R-Matrix.ipynb
I attempt to integrate R-Matrix into the training process. But it still has some problems.
Edit: And it is very slow, because in each step, FSRS needs to query R-Matrix.
Change weight to this:
weight = 0.9 * torch.pow(torch.tensor(count / 300), 0.5)
And remove w[13] and w[14], I want to see how it works with fixed parameters. Also, add something like
count = count.map(lambda x: min(x, 300)).to_numpy()
or
count = min(count, 300)
or something like that. I tried that, but I just keep getting frustrated because even the simplest changes like that are incredibly unintuitive in pytorch.
https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/R-Matrix.ipynb
It was the worst modification I had seen yet. Not only it was very slow, but it also produced the worst RMSE I had seen yet.
RMSE with R-matrix.ipynb: 0.0536 RMSE with baseline: 0.0184
Note: I have not tested the latest update to R-matrix.ipynb.
Ok, I figured out how to fix the issue with count
count = count.clamp(0, 300)
I used fixed w[13] and w[14] and tested it on one deck. I also tested the most recent version of R-Matrix.ipynb (with the fix for the count, as above), with optimizable weights. As user1823 said, it's slow and, but it is more accurate for me, at least based on a sample size of one, lol. But it's not that much more accurate for user1823, although it's also not as bad as he said, I don't know how he got RMSE>0.05. By the way, ideally RMSE should be calculated using weighted R, since that's what will be used in all other calculations.
I suggest discussing the R-matrix in that issue, let's keep the current issue for other things, at least for now.
Sherlock, before I test grade-derived_R.ipynb, I want you to add one little change: here:
grade_derived_r = dict(zip([1, 2, 3, 4], func(np.array([1, 2, 3, 4]), *params)))
The issue with my fitting function is that it might output negative values. It's unlikely, but possible. Add something like max(func(1), 0)
Sherlock, before I test grade-derived_R.ipynb, I want you to add one little change: here:
grade_derived_r = dict(zip([1, 2, 3, 4], func(np.array([1, 2, 3, 4]), *params)))
The issue with my fitting function is that it might output negative values. It's unlikely, but possible. Add something likemax(func(1), 0)
OK. Please check the latest version.
Also, this part is suspicious
grade_means = grade_r_matrix['r']['mean'].sort_values()
r_means = grade_r_matrix['y']['mean'].sort_values()
count = grade_r_matrix['y']['count'].sort_values()
Doesn't it break the correspondence between them? See example below:
Here are the results of testing the candidates, excluding grade-derived R as I still believe it has a bug
1) Adaptive grades make no difference. Replacing (rating-3) with (rating-2.5) makes no difference either, so I suggest we use it instead simply because it's more intuitive.
Currently, when we use (rating-3) in formulas for D: Again = D increases Hard = D increases Good = D doesn't change Easy = D decreases
If we use (rating-2.5): Again = D increases Hard = D increases Good = D decreases Easy = D decreases As you can see, it's more intuitive. And since it doesn't affect performance, we lose nothing by implementing it.
2) S0 curve_fit showed a very consistent improvement. The difference wasn't so big for (subjectively) hard decks, but it was huge for (subjectively) easy decks. Sherlock was the only one who didn't benefit from it for some reason, it made RMSE very slightly worse for his collection. It worked really well on all my decks + user8123's collection + nb9618's collection + JeffersonCorreia's collection.
3) Power curve outperformed the hybrid curve for all decks and all users.
4) PLS did not affect performance.
5) Power curve for D made it worse.
So so far we only have 2 good ideas, S0 curve_fit and power forgetting curve. Minus weighted loss aka "best epoch", minus 5 splits 5 epochs, because those two changes are related to the optimizer, not the algorithm itself. Assuming that the benefits (decrease relative to baseline) are multiplicative, S0 curve_fit + power forgetting curve combined should result in RMSE roughly 0.81*0.86=0.7 that of the baseline. So if baseline had RMSE=0.1, the new version would have RMSE=0.07, for example. This is not enough to warrant releasing a new version, we need more.
Doesn't it break the correspondence between them? See example below:
Yep, it is incorrect. I will fix it today.
I just want to leave a comment here to thank everyone pouring so much time and attention into this project to try to figure stuff like those out!
I wanted to help too, but I feel like I couldn't understand a single line in this post lol.
Interestingly, grade-derived R was actually performing better when it was bugged. In any case, it's just not worth it. Even the bugged version that performed better wasn't that much better, and it introduces 8 new parameters. It's very inefficient, in the improvement/parameters sense. We could tweak it a bit more (for example, grouping reviews into categories could be done better), but I doubt it would be worth it. So yeah, only 2 good ideas so far.
Assuming that the benefits (decrease relative to baseline) are multiplicative, S0 curve_fit + power forgetting curve combined should result in RMSE roughly 0.81*0.86=0.7 that of the baseline. This is not enough to warrant releasing a new version, we need more.
If you ask me, 30% reduction in RMSE is a lot. I think that we should not further delay the release of this version unless any one of the following holds:
Further, if we maintain backward compatibility with FSRS v3, the upgrade would not be a major issue for the users. This is because it would allow the users to upgrade when they want rather than forcing them to upgrade at the moment we release the update.
- You think that we will very soon discover some way to further improve the algorithm (such as improving the formula for D).
Perhaps. Sherlock will be working on my idea with best fit D, which we will then try to approximate, and I'm still trying to think of ways to improve the algorithm; I want to decrease RMSE by at least 2x in the new version.
I have found a potential bug for the Beta version. When the user doesn't have enough reviews for all first ratings, the fit_curve will raise a error.
You mean when the counts for all ratings are below 100? Well, that's just something that has to be accepted. Users need at least a thousand or two reviews, otherwise the results won't be accurate. If anything, it's better than letting the user run the optimizer with very few reviews and getting inaccurate results.
I have found a potential bug for the Beta version. When the user doesn't have enough reviews for all first ratings, the fit_curve will raise a error.
In this case, just raise a value error like the following:
The reason is the same as that pointed out by Expertium. We should ensure that the reviews are sufficient to produce accurate results. Otherwise, there is no point in running the optimizer.
I suppose we could add Easy Bonus to the Beta version, though I'm still trying to think of a way to turn it into a function rather than a constant.
def stability_after_success(self, state: Tensor, new_d: Tensor, r: Tensor, X: Tensor) -> Tensor:
easy_bonus_condition = X[:,1] == 4
easy_bonus = torch.where(easy_bonus_condition, self.w[13], 1)
new_s = state[:,0] * (1 + torch.exp(self.w[6]) *
(11 - new_d) *
torch.pow(state[:,0], self.w[7]) *
(torch.exp((1 - r) * self.w[8]) - 1) * easy_bonus)
return new_s
w[13] = w[13].clamp(1, 10)
Set the initial value to 2.
I tried a very similar idea, but with hard.
def stability_after_success(self, state: Tensor, new_d: Tensor, r: Tensor, X: Tensor) -> Tensor:
hard_punishment_condition = X[:,1] == 2
hard_punishment = torch.where(hard_punishment_condition, self.w[13], 1)
new_s = state[:,0] * (1 + torch.exp(self.w[6]) *
(11 - new_d) *
torch.pow(state[:,0], self.w[7]) *
(torch.exp((1 - r) * self.w[8]) - 1) * hard_punishment)
return new_s
w[13] = w[13].clamp(0.1, 1)
Set the initial value to 0.5.
Sorry for the awkward way of presenting the results, I couldn't think of a way to fit both of them neatly into a table. While I'm not against implementing both Easy Bonus and Hard Punishment, I hope that in the future we will find more general and flexible solutions. Also, while the constants do improve RMSE for their respective grades, the overall RMSE isn't improved much. Together they will improve the overall RMSE only by 4-5%.
S0 curve_fit + power forgetting curve + Easy Bonus + Hard Punishment + 5 epochs, 5 splits should result in RMSE decreasing to around 0.63-0.64 of the baseline. If we can come up with one more big change, we could maybe drop it to 0.5 (or 2x decrease).
Background
215
239
248
262
282
Baseline
https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/baseline-3.26.1.ipynb
Candidate
Link: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/new-baseline/candidate
Note
I plan to re-evaluate these candidate ideas one by one before we integrate them into Beta version.