Closed L-M-Sherlock closed 1 year ago
While I'm not against implementing both Easy Bonus and Hard Punishment, I hope that in the future we will find more general and flexible solutions. Also, while the constants do improve RMSE for their respective grades, the overall RMSE isn't improved much. Together they will improve the overall RMSE only by 4-5%.
I integrate them into one notebook:
But my implementation is different from yours:
new_s = torch.where(condition, self.stability_after_success(state, new_d, r), self.stability_after_failure(state, new_d, r))
new_s = torch.where(X[:,1] == 2, new_s * self.w[13], new_s)
new_s = torch.where(X[:,1] == 4, new_s * self.w[14], new_s)
I add them outside the formula of S.
The results of baseline:
The results of easy & hard factor:
For hard
, RMSE dropped from 0.0959 to 0.0334. For easy
, RMSE dropped from 0.0379 to 0.0249.
But my implementation is different from yours I add them outside the formula of S.
I am not very sure, but I think that Expertium's implementation is better, especially for Hard
.
Expertium's implementation ensures that the new stability is always greater than the previous stability.
In Sherlock's implementation, new stability can become smaller than the previous stability if 1/w[13] > SInc. But, this might be desirable because the card was difficult.
But, to know for sure, we will have to test which formula gives better results.
I am not very sure, but I think that Expertium's implementation is better, especially for
Hard
.Expertium's implementation ensures that the new stability is always greater than the previous stability.
Anki's default setting also allows users to set a factor lower than 1.
I integrate easy & hard factor
into Beta. The improvement for interval after easy
and hard
is significant:
But some ridiculous things happened:
The RMSE for all last ratings decreased. But the total RMSE increased:
It would mean that the RMSE is too sensitive.
I agree with user1823, we shouldn't allow new_s to be smaller than the previous S. It doesn't make sense for memory to become less stable if it's not a lapse.
I agree with user1823, we shouldn't allow new_s to be smaller than the previous S. It doesn't make sense for memory to become less stable if it's not a lapse.
OK. You can check this version:
def stability_after_success(self, state: Tensor, new_d: Tensor, r: Tensor, rating: Tensor) -> Tensor:
hard_bonus = torch.where(rating == 2, self.w[13], 1)
easy_bonus = torch.where(rating == 4, self.w[14], 1)
new_s = state[:,0] * (1 + torch.exp(self.w[6]) *
(11 - new_d) *
torch.pow(state[:,0], self.w[7]) *
(torch.exp((1 - r) * self.w[8]) - 1) * hard_bonus * easy_bonus)
return new_s
w[13] = w[13].clamp(0.01, 1)
w[14] = w[14].clamp(1, 2.5)
In my collection, hard_bonus
is 0.01
(the lower limit).
I was just typing this when you commented.
I tested easy-hard-factor.ipynb, then modified it like this:
def stability_after_success(self, state: Tensor, new_d: Tensor, r: Tensor, X: Tensor) -> Tensor:
Sinc = torch.exp(self.w[6]) * (11 - new_d) * torch.pow(state[:,0], self.w[7]) * (torch.exp((1 - r) * self.w[8]) - 1)
new_Sinc = torch.where(X[:,1] == 2, Sinc * self.w[13], (torch.where(X[:,1] == 4, Sinc * self.w[14], Sinc)))
new_s = state[:,0] * (1 + new_Sinc)
return new_s
Both versions performed about the same in terms of overall RMSE and RMSE(Easy). Sherlock's version was better for "Hard", on average, but this effect wasn't consistent enough (sometimes my implementation would perform a little better), so it's not statistically significant.
Maybe I often pressed Hard
when I should press again
, which made the stability after Hard
decrease.
I am not very sure, but I think that Expertium's implementation is better, especially for
Hard
.I agree with user1823, we shouldn't allow new_s to be smaller than the previous S. It doesn't make sense for memory to become less stable if it's not a lapse.
I have updated the code just now. The current Beta employs the bonus
version.
Sherlock, are you sure you want to use PLS offset? In my testing it didn't affect RMSE at all.
Also, w[0] and w[1] are unused in fsrs4anki_optimizer_beta.ipynb, now that we have a different way of estimating intial S. You should remove them.
Sherlock, are you sure you want to use PLS offset? In my testing it didn't affect RMSE at all.
I add PLS offset because x^a > x when 0 < x < 1 and 0 < a < 1.
Also, w[0] and w[1] are unused in fsrs4anki_optimizer_beta.ipynb, now that we have a different way of estimating intial S. You should remove them.
I will remove them when I release the stable version.
I noticed that if there is no value for "Again" (not enough reviews) and an extrapolated value is used, it can end up being greater than the value for "Hard".
I suggest that in that case the value for "Again" should be replaced with the value for "Hard". So the algorithm is like this
1) If there is enough data to calculate the value for "Again" exactly, calculate it exactly
2) If there isn't enough data, replace the missing value with an extrapolation using S0_rating_curve
3) If it breaks monotonicity (Again > Hard), then replace the value for "Again" with the exact value for "Hard"
4) If both exact values for "Again" and "Hard" are not available, then just use the extrapolation from S0_rating_curve
There are probably other ways in which it could get weird, so it's possible that we will need to cover more edge cases.
S0 curve_fit + power forgetting curve + Easy Bonus + Hard Punishment + 5 epochs, 5 splits should result in RMSE decreasing to around 0.63-0.64 of the baseline.
I was pretty close, RMSE is reduced to about 68% compared to 3.26.1.ipynb. I just tested fsrs4anki_optimizer_beta.ipynb.
I plan to integrate this into Beta:
It could filter out those reviews delayed 4 times about their original intervals.
Its overdue_rate is 542...
But, what if AnkiDroid gave an interval of 1 day to a card and then the helper add-on gave it an interval of 6 days? Which interval would be considered — 1 day or 6 days?
If it considers 1 day as the interval, a large number of cards in my collection would be filtered out.
But, what if AnkiDroid gave an interval of 1 day to a card and then the helper gave it an interval of 6 days? Which interval would be considered 1 day or 6 days?
Oops. It is really a problem. In this case, the revlog will record the interval scheduled by the default algorithm instead of the rescheduled interval.
So apparently S can decrease after a review in some cases. But as Woz said, this could introduce a feedback loop, so I'm not sure if we should allow Sinc to be <1.
Unrelated, but Sherlock, please change the layout of charts. It's very hard to tell which chart corresponds to which grade when all the text is above them. I can tell which one corresponds to "Easy", just because it looks awful for all my decks even with the bonus. But other than that I might as well roll a dice to determine what are the other three.
The plot starts from the upper left with 1 (again) to the right with 2 (hard). 3 (good) is in the lower left corner. I don't think the subplots are hard to tell. I will add titles to these subplots, instead of reframing the layout.
3 (good) is in the lower right corner
Did you mean lower left?
And I insist that we change the formula for D to make it more intuitive. It doesn't affect performance (RMSE changes by a fraction of a percent) and makes the meaning of grades more intuitive.
I have implemented a fix for this specific edge case. @L-M-Sherlock change the code for calculating S0:
for rating in (1, 2, 3, 4):
again_extrap = max(min(S0_rating_curve(1, *params), 3650), 0.1)
# if there isn't enough data to calculate the value for "Again" exactly
if 1 not in rating_stability:
# then check if there exists an exact value for "Hard"
if 2 in rating_stability:
# if it exists, then check whether the extrapolation breaks monotonicity
# Again > Hard is possible, but we should allow it only for exact values, otherwise we should assume monotonicity
if again_extrap > rating_stability[2]:
# if it does, then replace the missing "Again" value with the exact "Hard" value
rating_stability[1] = rating_stability[2]
else:
# if it doesn't break monotonicity, then use the extrapolated value
rating_stability[1] = again_extrap
# if an exact value for "Hard" doesn't exist, then just use the extrapolation, there's nothing else we can do
else:
rating_stability[1] = again_extrap
elif rating not in rating_stability:
rating_stability[rating] = max(min(S0_rating_curve(rating, *params), 3650), 0.1)
We may need to add more fixes in the future, this method of estimating S0 has a lot of room for weird edge cases.
And I insist that we change the formula for D to make it more intuitive. It doesn't affect performance (RMSE changes by a fraction of a percent) and makes the meaning of grades more intuitive.
I have replied this here: https://github.com/open-spaced-repetition/fsrs4anki/issues/342#issuecomment-1625731626
I think the 4.0.0 Beta version has already shaped. Any Matrix-related ideas would be delayed to 5.0.0. My consideration is:
Before we launch officially, there are a few matters to address:
If there's anything I missed, feel free to contribute.
@L-M-Sherlock, do you think that the minimum value of w[5] should be again decreased to a lower value (such as 0.0005)?
I have observed that using 0.05 not only makes RMSE worse, but also increases the workload (the number of cards to do per day). See the number of due cards with the two parameters of 31st May in this comment: https://github.com/open-spaced-repetition/fsrs4anki/issues/187#issuecomment-1605446604
The motive behind increasing the minimum value of w[5] was to decrease the workload (by preventing ease hell). But, it is increasing the workload instead.
I have observed that using 0.05 not only makes RMSE worse, but also increases the workload (the number of cards to do per day). See the number of due cards with the two parameters of 31st May in this comment: #187 (comment)
The motive behind increasing the minimum value of w[5] was to decrease the workload (by preventing ease hell). But, it is increasing the workload instead.
One rational explanation for your case is, you don't have ease hell, but w[5]
assume you have. So w[5]
will decrease the difficulty in the long-term. Then w[4]
would increase to counteract or even override it, which induces the workload.
So, how do you plan to solve this issue?
So, how do you plan to solve this issue?
To loosen the lower limit of w[5]
is OK. But we need to convince people whose w[5] is very low that Ease Hell doesn't affect their collections.
But, make sure that the lower limit is not zero. It should be a small but non-zero value because it doesn't make sense to have no mean reversion at all.
In my case, I have found that 0.0003 to be good enough. It only slightly increases the RMSE (as compared to setting the minimum value to 0).
But, make sure that the lower limit is not zero. It should be a small but non-zero value because it doesn't make sense to have no mean reversion at all.
I don't think so. The difference between 0.0003
and 0
is pretty small. If the initial difficulty is 5
and the current difficulty is 10
. If you always press Good
, here is the subsequent difficulty:
5 0.0003 + 10 (1-0.0003) = 9.9985 5 0.0003 + 9.9985 (1-0.0003) = 9.997 5 0.0003 + 9.997 (1-0.0003) = 9.9955 ...
Ok, then I will do some testing tomorrow evening because I am slightly busy today and tomorrow. Till then, you can work on updating the code of the scheduler and the helper add-on for v4.
I have made a pre-release for 4.0.0. The beta testing is here:
I have replied this here: #342 (comment)
But that problem doesn't show up in testing.
I think the 4.0.0 Beta version has already shaped. Any Matrix-related ideas would be delayed to 5.0.0.
Well, that's something I disagree with. I understand that it would be difficult to implement matrices in the scheduler/helper add-on, but it could, if done right, improve the accuracy a lot. And searching for a perfect formula may lead nowhere. If even Woz uses matrices to combine theoretical predictions with measurements from raw data, after 30 years of research, then it's safe to assume that relying purely on theoretical formulas isn't enough.
If you're going to release v4, then there are 2 more things that need to be done: 1) Re-writing the wiki. I suggest opening a new issue for that, since there is a lot that needs to be changed. Me and @user1823 will help you. The page about formulas obviously has to be changed. The page about comparing FSRS to Anki should be about RMSE and the universal metric, which leads me to my second point 2) @L-M-Sherlock, please make a special file for benchmarking purposes only. It should contain FSRS v4, v3, LSTM, SM-2 and Memrise algorithms. The first 3 must be optimized, all within the same file, and then the RMSE and universal metrics (against FSRS v4) should be displayed. I'm planning to write 2 reddit posts about FSRS, and I need benchmarks for that. You will need them too, to re-write that page on the wiki that currently just compares interval lengths of FSRS and Anki, which is useless.
Well, that's something I disagree with. I understand that it would be difficult to implement matrices in the scheduler/helper add-on, but it could, if done right, improve the accuracy a lot. And searching for a perfect formula may lead nowhere. If even Woz uses matrices to combine theoretical predictions with measurements from raw data, after 30 years of research, then it's safe to assume that relying purely on theoretical formulas isn't enough.
I agree that matrices could potentially enhance accuracy, but the return on investment so far doesn't seem high. We've been working hard on the v4 version for over two months and it already shows a 20-30% decrease in RMSE. This feels like a significant step forward worthy of a standalone release. While Woz uses matrices in SuperMemo's algorithm, the specifics aren't open-source so it's hard to measure their impact. I plan to continue experimenting with matrices, but I'm approaching it cautiously. Thanks again for your insights.
Re-writing the wiki. I suggest opening a new issue for that, since there is a lot that needs to be changed. Me and @user1823 will help you. The page about formulas obviously has to be changed. The page about comparing FSRS to Anki should be about RMSE and the universal metric, which leads me to my second point
Yeah. I'm writing documents. I will share my drafts later.
please make a special file for benchmarking purposes only. It should contain FSRS v4, v3, LSTM, SM-2 and Memrise algorithms. The first 3 must be optimized, all within the same file, and then the RMSE and universal metrics (against FSRS v4) should be displayed. I'm planning to write 2 reddit posts about FSRS, and I need benchmarks for that. You will need them too, to re-write that page on the wiki that currently just compares interval lengths of FSRS and Anki, which is useless.
What about the datasets? Do you have any preference? I have only compared algorithms on the collections of myself, yours and @user1823.
I agree that matrices could potentially enhance accuracy, but the return on investment so far doesn't seem high.
I assume you won't try my idea with new grouping, at least not in the near future. If that idea didn't work either, I would be more willing to give up.
What about the datasets? Do you have any preference?
For my reddit posts I will use 2 collections, my and yours. For the wiki you should probably use your collection. Also, I highly recommend you to run the new optimizer on all collections that people have submitted to you via the Google form to find good average parameters. I'll explain the details in the other issue.
Background
215
239
248
262
282
Baseline
https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/baseline-3.26.1.ipynb
Candidate
Link: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/new-baseline/candidate
Note
I plan to re-evaluate these candidate ideas one by one before we integrate them into Beta version.