open-spaced-repetition / fsrs4anki

A modern Anki custom scheduling based on Free Spaced Repetition Scheduler algorithm
https://github.com/open-spaced-repetition/fsrs4anki/wiki
MIT License
2.74k stars 133 forks source link

[Feature Request] FSRS4Anki Optimizer 4.0 Beta #342

Closed L-M-Sherlock closed 1 year ago

L-M-Sherlock commented 1 year ago

Background

Baseline

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/baseline-3.26.1.ipynb

Candidate

Idea Effect Comment
power forgetting curve Positive more accurate for heterogeneous reviews
S0 curve fit Positive more accurate initial stability
post-lapse stability offset Null it could prevent stability increasing after lapse
power difficulty Negative
adaptive grade for difficulty Null
grade-derived R

Link: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/new-baseline/candidate

Note

I plan to re-evaluate these candidate ideas one by one before we integrate them into Beta version.

L-M-Sherlock commented 1 year ago

While I'm not against implementing both Easy Bonus and Hard Punishment, I hope that in the future we will find more general and flexible solutions. Also, while the constants do improve RMSE for their respective grades, the overall RMSE isn't improved much. Together they will improve the overall RMSE only by 4-5%.

I integrate them into one notebook:

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/easy-hard-factor.ipynb

But my implementation is different from yours:

            new_s = torch.where(condition, self.stability_after_success(state, new_d, r), self.stability_after_failure(state, new_d, r))
            new_s = torch.where(X[:,1] == 2, new_s * self.w[13], new_s)
            new_s = torch.where(X[:,1] == 4, new_s * self.w[14], new_s)

I add them outside the formula of S.

The results of baseline:

image

The results of easy & hard factor:

image

For hard, RMSE dropped from 0.0959 to 0.0334. For easy, RMSE dropped from 0.0379 to 0.0249.

user1823 commented 1 year ago

But my implementation is different from yours I add them outside the formula of S.

I am not very sure, but I think that Expertium's implementation is better, especially for Hard.

Expertium's implementation ensures that the new stability is always greater than the previous stability.

In Sherlock's implementation, new stability can become smaller than the previous stability if 1/w[13] > SInc. But, this might be desirable because the card was difficult.

But, to know for sure, we will have to test which formula gives better results.

L-M-Sherlock commented 1 year ago

I am not very sure, but I think that Expertium's implementation is better, especially for Hard.

Expertium's implementation ensures that the new stability is always greater than the previous stability.

image

Anki's default setting also allows users to set a factor lower than 1.

L-M-Sherlock commented 1 year ago

I integrate easy & hard factor into Beta. The improvement for interval after easy and hard is significant:

image

But some ridiculous things happened:

image

The RMSE for all last ratings decreased. But the total RMSE increased:

image

It would mean that the RMSE is too sensitive.

Expertium commented 1 year ago

I agree with user1823, we shouldn't allow new_s to be smaller than the previous S. It doesn't make sense for memory to become less stable if it's not a lapse.

L-M-Sherlock commented 1 year ago

I agree with user1823, we shouldn't allow new_s to be smaller than the previous S. It doesn't make sense for memory to become less stable if it's not a lapse.

OK. You can check this version:

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/easy-hard-bonus.ipynb

    def stability_after_success(self, state: Tensor, new_d: Tensor, r: Tensor, rating: Tensor) -> Tensor:
        hard_bonus = torch.where(rating == 2, self.w[13], 1)
        easy_bonus = torch.where(rating == 4, self.w[14], 1)
        new_s = state[:,0] * (1 + torch.exp(self.w[6]) *
                        (11 - new_d) *
                        torch.pow(state[:,0], self.w[7]) *
                        (torch.exp((1 - r) * self.w[8]) - 1) * hard_bonus * easy_bonus)
        return new_s
            w[13] = w[13].clamp(0.01, 1)
            w[14] = w[14].clamp(1, 2.5)

In my collection, hard_bonus is 0.01 (the lower limit).

Expertium commented 1 year ago

I was just typing this when you commented.

I tested easy-hard-factor.ipynb, then modified it like this:

    def stability_after_success(self, state: Tensor, new_d: Tensor, r: Tensor, X: Tensor) -> Tensor:
        Sinc = torch.exp(self.w[6]) * (11 - new_d) * torch.pow(state[:,0], self.w[7]) * (torch.exp((1 - r) * self.w[8]) - 1)
        new_Sinc = torch.where(X[:,1] == 2, Sinc * self.w[13], (torch.where(X[:,1] == 4, Sinc * self.w[14], Sinc)))
        new_s = state[:,0] * (1 + new_Sinc)
        return new_s

Both versions performed about the same in terms of overall RMSE and RMSE(Easy). Sherlock's version was better for "Hard", on average, but this effect wasn't consistent enough (sometimes my implementation would perform a little better), so it's not statistically significant. image

L-M-Sherlock commented 1 year ago

Maybe I often pressed Hard when I should press again, which made the stability after Hard decrease.

L-M-Sherlock commented 1 year ago

I am not very sure, but I think that Expertium's implementation is better, especially for Hard.

I agree with user1823, we shouldn't allow new_s to be smaller than the previous S. It doesn't make sense for memory to become less stable if it's not a lapse.

I have updated the code just now. The current Beta employs the bonus version.

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/fsrs4anki_optimizer_beta.ipynb

Expertium commented 1 year ago

Sherlock, are you sure you want to use PLS offset? In my testing it didn't affect RMSE at all. image

Expertium commented 1 year ago

Also, w[0] and w[1] are unused in fsrs4anki_optimizer_beta.ipynb, now that we have a different way of estimating intial S. You should remove them.

L-M-Sherlock commented 1 year ago

Sherlock, are you sure you want to use PLS offset? In my testing it didn't affect RMSE at all.

I add PLS offset because x^a > x when 0 < x < 1 and 0 < a < 1.

image

Also, w[0] and w[1] are unused in fsrs4anki_optimizer_beta.ipynb, now that we have a different way of estimating intial S. You should remove them.

I will remove them when I release the stable version.

Expertium commented 1 year ago

image I noticed that if there is no value for "Again" (not enough reviews) and an extrapolated value is used, it can end up being greater than the value for "Hard". I suggest that in that case the value for "Again" should be replaced with the value for "Hard". So the algorithm is like this 1) If there is enough data to calculate the value for "Again" exactly, calculate it exactly 2) If there isn't enough data, replace the missing value with an extrapolation using S0_rating_curve 3) If it breaks monotonicity (Again > Hard), then replace the value for "Again" with the exact value for "Hard" 4) If both exact values for "Again" and "Hard" are not available, then just use the extrapolation from S0_rating_curve

There are probably other ways in which it could get weird, so it's possible that we will need to cover more edge cases.

Expertium commented 1 year ago

S0 curve_fit + power forgetting curve + Easy Bonus + Hard Punishment + 5 epochs, 5 splits should result in RMSE decreasing to around 0.63-0.64 of the baseline.

I was pretty close, RMSE is reduced to about 68% compared to 3.26.1.ipynb. I just tested fsrs4anki_optimizer_beta.ipynb.

L-M-Sherlock commented 1 year ago

I plan to integrate this into Beta:

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/filter-out-overdue-reviews.ipynb

It could filter out those reviews delayed 4 times about their original intervals.

L-M-Sherlock commented 1 year ago
image

Its overdue_rate is 542...

user1823 commented 1 year ago

But, what if AnkiDroid gave an interval of 1 day to a card and then the helper add-on gave it an interval of 6 days? Which interval would be considered — 1 day or 6 days?

If it considers 1 day as the interval, a large number of cards in my collection would be filtered out.

L-M-Sherlock commented 1 year ago

But, what if AnkiDroid gave an interval of 1 day to a card and then the helper gave it an interval of 6 days? Which interval would be considered 1 day or 6 days?

Oops. It is really a problem. In this case, the revlog will record the interval scheduled by the default algorithm instead of the rescheduled interval.

Expertium commented 1 year ago

image So apparently S can decrease after a review in some cases. But as Woz said, this could introduce a feedback loop, so I'm not sure if we should allow Sinc to be <1.

Unrelated, but Sherlock, please change the layout of charts. It's very hard to tell which chart corresponds to which grade when all the text is above them. I can tell which one corresponds to "Easy", just because it looks awful for all my decks even with the bonus. But other than that I might as well roll a dice to determine what are the other three. image

L-M-Sherlock commented 1 year ago

The plot starts from the upper left with 1 (again) to the right with 2 (hard). 3 (good) is in the lower left corner. I don't think the subplots are hard to tell. I will add titles to these subplots, instead of reframing the layout.

Expertium commented 1 year ago

3 (good) is in the lower right corner

Did you mean lower left?

Expertium commented 1 year ago

And I insist that we change the formula for D to make it more intuitive. It doesn't affect performance (RMSE changes by a fraction of a percent) and makes the meaning of grades more intuitive.

Expertium commented 1 year ago

I have implemented a fix for this specific edge case. @L-M-Sherlock change the code for calculating S0:

for rating in (1, 2, 3, 4):
    again_extrap = max(min(S0_rating_curve(1, *params), 3650), 0.1)
    # if there isn't enough data to calculate the value for "Again" exactly
    if 1 not in rating_stability:
        # then check if there exists an exact value for "Hard"
        if 2 in rating_stability:
            # if it exists, then check whether the extrapolation breaks monotonicity
            # Again > Hard is possible, but we should allow it only for exact values, otherwise we should assume monotonicity
            if again_extrap > rating_stability[2]:
                # if it does, then replace the missing "Again" value with the exact "Hard" value
                rating_stability[1] = rating_stability[2]
            else:
                # if it doesn't break monotonicity, then use the extrapolated value
                rating_stability[1] = again_extrap
        # if an exact value for "Hard" doesn't exist, then just use the extrapolation, there's nothing else we can do
        else:
            rating_stability[1] = again_extrap
    elif rating not in rating_stability:
        rating_stability[rating] = max(min(S0_rating_curve(rating, *params), 3650), 0.1)

image We may need to add more fixes in the future, this method of estimating S0 has a lot of room for weird edge cases.

L-M-Sherlock commented 1 year ago

And I insist that we change the formula for D to make it more intuitive. It doesn't affect performance (RMSE changes by a fraction of a percent) and makes the meaning of grades more intuitive.

I have replied this here: https://github.com/open-spaced-repetition/fsrs4anki/issues/342#issuecomment-1625731626

L-M-Sherlock commented 1 year ago

I think the 4.0.0 Beta version has already shaped. Any Matrix-related ideas would be delayed to 5.0.0. My consideration is:

  1. The scheduler couldn't use the matrix.
  2. The size of the matrix is several orders of magnitude larger than the current parameters.
  3. Our ultimate goal is to find the perfect memory formula. The matrix merely uses statistical data to mask the model's poor predictive power.

Before we launch officially, there are a few matters to address:

  1. Remove unused parameters.
  2. Update the python package.
  3. Update the scheduler code.
  4. Ensure the helper add-on compatible.
  5. Write introduction about the major update.

If there's anything I missed, feel free to contribute.

user1823 commented 1 year ago

@L-M-Sherlock, do you think that the minimum value of w[5] should be again decreased to a lower value (such as 0.0005)?

I have observed that using 0.05 not only makes RMSE worse, but also increases the workload (the number of cards to do per day). See the number of due cards with the two parameters of 31st May in this comment: https://github.com/open-spaced-repetition/fsrs4anki/issues/187#issuecomment-1605446604

The motive behind increasing the minimum value of w[5] was to decrease the workload (by preventing ease hell). But, it is increasing the workload instead.

L-M-Sherlock commented 1 year ago

I have observed that using 0.05 not only makes RMSE worse, but also increases the workload (the number of cards to do per day). See the number of due cards with the two parameters of 31st May in this comment: #187 (comment)

The motive behind increasing the minimum value of w[5] was to decrease the workload (by preventing ease hell). But, it is increasing the workload instead.

One rational explanation for your case is, you don't have ease hell, but w[5] assume you have. So w[5] will decrease the difficulty in the long-term. Then w[4] would increase to counteract or even override it, which induces the workload.

user1823 commented 1 year ago

So, how do you plan to solve this issue?

L-M-Sherlock commented 1 year ago

So, how do you plan to solve this issue?

To loosen the lower limit of w[5] is OK. But we need to convince people whose w[5] is very low that Ease Hell doesn't affect their collections.

user1823 commented 1 year ago

But, make sure that the lower limit is not zero. It should be a small but non-zero value because it doesn't make sense to have no mean reversion at all.

In my case, I have found that 0.0003 to be good enough. It only slightly increases the RMSE (as compared to setting the minimum value to 0).

L-M-Sherlock commented 1 year ago

But, make sure that the lower limit is not zero. It should be a small but non-zero value because it doesn't make sense to have no mean reversion at all.

I don't think so. The difference between 0.0003 and 0 is pretty small. If the initial difficulty is 5 and the current difficulty is 10. If you always press Good, here is the subsequent difficulty:

5 0.0003 + 10 (1-0.0003) = 9.9985 5 0.0003 + 9.9985 (1-0.0003) = 9.997 5 0.0003 + 9.997 (1-0.0003) = 9.9955 ...

user1823 commented 1 year ago

Ok, then I will do some testing tomorrow evening because I am slightly busy today and tomorrow. Till then, you can work on updating the code of the scheduler and the helper add-on for v4.

L-M-Sherlock commented 1 year ago

I have made a pre-release for 4.0.0. The beta testing is here:

Expertium commented 1 year ago

I have replied this here: #342 (comment)

But that problem doesn't show up in testing.

I think the 4.0.0 Beta version has already shaped. Any Matrix-related ideas would be delayed to 5.0.0.

Well, that's something I disagree with. I understand that it would be difficult to implement matrices in the scheduler/helper add-on, but it could, if done right, improve the accuracy a lot. And searching for a perfect formula may lead nowhere. If even Woz uses matrices to combine theoretical predictions with measurements from raw data, after 30 years of research, then it's safe to assume that relying purely on theoretical formulas isn't enough.

Expertium commented 1 year ago

If you're going to release v4, then there are 2 more things that need to be done: 1) Re-writing the wiki. I suggest opening a new issue for that, since there is a lot that needs to be changed. Me and @user1823 will help you. The page about formulas obviously has to be changed. The page about comparing FSRS to Anki should be about RMSE and the universal metric, which leads me to my second point 2) @L-M-Sherlock, please make a special file for benchmarking purposes only. It should contain FSRS v4, v3, LSTM, SM-2 and Memrise algorithms. The first 3 must be optimized, all within the same file, and then the RMSE and universal metrics (against FSRS v4) should be displayed. I'm planning to write 2 reddit posts about FSRS, and I need benchmarks for that. You will need them too, to re-write that page on the wiki that currently just compares interval lengths of FSRS and Anki, which is useless.

L-M-Sherlock commented 1 year ago

Well, that's something I disagree with. I understand that it would be difficult to implement matrices in the scheduler/helper add-on, but it could, if done right, improve the accuracy a lot. And searching for a perfect formula may lead nowhere. If even Woz uses matrices to combine theoretical predictions with measurements from raw data, after 30 years of research, then it's safe to assume that relying purely on theoretical formulas isn't enough.

I agree that matrices could potentially enhance accuracy, but the return on investment so far doesn't seem high. We've been working hard on the v4 version for over two months and it already shows a 20-30% decrease in RMSE. This feels like a significant step forward worthy of a standalone release. While Woz uses matrices in SuperMemo's algorithm, the specifics aren't open-source so it's hard to measure their impact. I plan to continue experimenting with matrices, but I'm approaching it cautiously. Thanks again for your insights.

Re-writing the wiki. I suggest opening a new issue for that, since there is a lot that needs to be changed. Me and @user1823 will help you. The page about formulas obviously has to be changed. The page about comparing FSRS to Anki should be about RMSE and the universal metric, which leads me to my second point

Yeah. I'm writing documents. I will share my drafts later.

please make a special file for benchmarking purposes only. It should contain FSRS v4, v3, LSTM, SM-2 and Memrise algorithms. The first 3 must be optimized, all within the same file, and then the RMSE and universal metrics (against FSRS v4) should be displayed. I'm planning to write 2 reddit posts about FSRS, and I need benchmarks for that. You will need them too, to re-write that page on the wiki that currently just compares interval lengths of FSRS and Anki, which is useless.

What about the datasets? Do you have any preference? I have only compared algorithms on the collections of myself, yours and @user1823.

Expertium commented 1 year ago

I agree that matrices could potentially enhance accuracy, but the return on investment so far doesn't seem high.

I assume you won't try my idea with new grouping, at least not in the near future. If that idea didn't work either, I would be more willing to give up.

What about the datasets? Do you have any preference?

For my reddit posts I will use 2 collections, my and yours. For the wiki you should probably use your collection. Also, I highly recommend you to run the new optimizer on all collections that people have submitted to you via the Google form to find good average parameters. I'll explain the details in the other issue.