[Feature Request] Sharing ideas for further improvement of the algorithm

Expertium commented 1 year ago

Which module is related to your feature request? Scheduler/Optimizer

Is your feature request related to a problem? Please describe. Here I will post all my ideas (new formulas) to improve the algorithm - decrease log-loss and RMSE.

Describe the solution you'd like

Make grade values optimizable parameters. Currently Again=1, Hard=2, Good=3, Easy=4. I propose that we make them optimizable as well, and clamp them (although I'm not sure about the range, should it be [1, 10] or something different?). This introduces 4 new parameters. EDIT: the range should be [0, 10], according to my testing grades can go below 1 and it will improve loss. Formulas that involve grades will have to be slightly changed - replace (G-1) and (G-3) with (G-g_again) and (G-g_good), where g_again and g_good are the numerical values assigned to "Again" and "Good", respectively.
~~Change this: To this: Basically, as a final step in computing S, we raise it to some power w_14. Remember how well using power function with a trainable parameter for the forgetting curve worked out? I believe this way we can mimic that effect. This introduces 1 new parameter.~~
~~Change the way S is calculated after a lapse ("Again") to include the number of lapses. So from this:~~ ~~To this:~~ Where L - number of lapses. +1 is needed to avoid multiplication by 0, which would result in S=0. w_13 should be negative, otherwise more lapses will lead to higher S, which is nonsence. This introduces 1 new parameter.
~~https://github.com/open-spaced-repetition/fsrs4anki/issues/239#issuecomment-1532173351~~ Forget about it, anything that requires R_requested won't work. R_requested can be different if the user decided to change it at some point, and more importantly, it's just not a thing in native Anki, there is no such thing as setting a desired probability of recall.
If at all possible, replace learning steps and re-learning steps with functions/parameters to make them adaptive and ensure that users themselves don't have to worry about setting steps.

Also, I suggest replacing w_9 with e^w_9. It probably won't matter, but it's just more consistent since in the other formula you use e^w_6 rather than just w_6.

None of these changes affect the forgetting curve (you can use either 0.9^t/S or (1+t/9s)^-1), so the meaning of S - number days it takes for user's retention to drop from 100% to 90% - will remain unchanged.

EDIT: purely for the sake of clarity and keeping things easy to understand, I think all formulas should be rewritten in such a way that parameters are always positive. For exampple, if you have D^w (w is a parameter) and w is is always positive - good, don't change anything. If w is always negative - rewrite the formula as D^-w and keep w positive. This should make it easier to read the wiki.

user1823 commented 1 year ago

Make grade values optimizable parameters. Currently Again=1, Hard=2, Good=3, Easy=4. I propose that we make them optimizable as well, and clamp them (although I'm not sure about the range, should it be [1, 4] or something different?). This introduces 4 new parameters.

I second this. It would allow us to account for the fact that different users use the grades differently.

Some users only use the Good and Again buttons.
Some users mostly use the Good and Again buttons but also use the Hard and the Easy buttons intermittently.
Some users use all the 4 buttons.

Edit: If we implement this, we can also do away with the easy bonus setting because the optimizer would automatically calculate how much the stability/difficulty/interval should be changed with the press of the Easy button.

Expertium commented 1 year ago

Since Sherlock seems to be absent today, I decided to use my meager Python knowledge to implement number 2 myself. Thankfully, it's way easier than number 1 or number 3, I don't think I will be able to do those on my own, both of those would require major changes.

Unfortunately, it doesn't seem like this worked. Both loss and RMSE are about the same, and the new parameter (power for S) itself barely changes and is very close to 1 (usually around 0.95-1.05), which means that any large deviation from 1 is not optimal. Although increasing the learning rate seems to produce better results, it's possible that increasing the learning rate would help even without the new parameter. I did the usual procedure, 7 decks + collection, then Wilcoxon signed-rank test, with lr=5e-4 (same as the one I used when testing without any new changes, to ensure a fair comparison): log-loss is 0.1% better (which is practically 0) on average, RMSE is 0.2% better, and the p-values are much greater than 0.05, so there is no statistically significant difference in performance between the old and new algorithms.

But there are some good news: lr=5e-3 generally produces better results than lr=5e-4 for many different versions of the algorithm, so the default value should be 5e-3 for better results.

user1823 commented 1 year ago

lr=5e-3 generally produces better results than lr=5e-4 for many different versions of the algorithm, so the default value should be 5e-3 for better results.

Unfortunately, this is not always the case. I tried using 5e-3 with the power forgetting curve (with optimizable power) and I got the following results:

Learning Rate	Log Loss	RMSE	R-squared
5e-4	0.2231	0.0147	0.9234
5e-3	0.2225	0.0162	0.9129

Also, the intervals rose too quickly with the 5e-3 as shown below:

lr = 5e-4:

lr = 5e-3:

Expertium commented 1 year ago

Interesting, in my case it actually helped with the power forgetting curve. We could run 2 optimizations with different lr and then just choose whichever set of parameters minimizes the log-loss better, but that would almost double the total time needed.

Expertium commented 1 year ago

After some trial and error, I figured out how to implement adaptive grades. Hopefully, I didn't mess anything up.

I did the standard procedure (lr=5e-4, again, for the sake of consistency with previous tests, even though 5e-3 is better), and the results are decent. Log-loss went down by 2% on average, RMSE went down by around 8% on average, and the difference in performance between this and the old algorithm is statistically significant with p<0.01. While this isn't quite as impressive as the adaptive power forgetting curve, it's progress.

It's probably too early to worry, but I'm concerned about how little log-loss changes. Even the adaptive power forgetting curve only improved log-loss by 4% on average, this one improved log-loss by 2%. I feel like we won't be able to significantly decrease it unless we do something extremely clever and extremely complicated. Interestingly, RMSE improved more than log-loss in both cases (adaptive power forgetting curve and adaptive grades).

Also, I have an idea for a non-linear transformation of difficulty, I'll test it and see if it's better than power function for difficulty. EDIT: my transformation is weird. For some reason optimizer barely changes the new parameters that I added. Whatever initial values I set, the "optimal" values are very close to them, and the loss isn't improving. I guess this idea didn't work out.

Also, at some point I should make a table comparing different changes.

Expertium commented 1 year ago

Here's a table with different modifications. The old FSRS algorithm is the baseline. A value less than 100% means that this modification performs better than the baseline, a value greater than 100% means that this modification performs worse than the baseline. p-value < 0.01 means that it's unlikely to be a coincidence and there really is a difference between a particular improved version and the baseline. So far, the only modification worth implementing is adaptive (aka optimizable) grades. Power forgetting curve changes the meaning of S, so we're not going to do that, I suppose.

Expertium commented 1 year ago

@L-M-Sherlock here's my spaghetti code implementation of optimizable values for Again-Hard-Good-Easy. I tested it, and it worked well (see the table above in my previous comment). It would be great if you implemented this (and there's prbably a better way to implement it, I'll leave it to you) so we could all test it.

2.1 Define the model
FSRS is a time-series model for predicting memory states.

init_w = [1, 1, 5, -0.5, -0.5, 0.2, 1.4, -0.12, 0.8, 2, -0.2, 0.2, 1, 1, 2, 3, 4]
'''
w[0]: initial_stability_for_again_answer
w[1]: initial_stability_step_per_rating
w[2]: initial_difficulty_for_good_answer
w[3]: initial_difficulty_step_per_rating
w[4]: next_difficulty_step_per_rating
w[5]: next_difficulty_reversion_to_mean_speed (used to avoid ease hell)
w[6]: next_stability_factor_after_success
w[7]: next_stability_stabilization_decay_after_success
w[8]: next_stability_retrievability_gain_after_success
w[9]: next_stability_factor_after_failure
w[10]: next_stability_difficulty_decay_after_success
w[11]: next_stability_stability_gain_after_failure
w[12]: next_stability_retrievability_gain_after_failure
w[13]: again_grade_value
w[14]: hard_grade_value
w[15]: good_grade_value
w[16]: easy_grade_value
For more details about the parameters, please see: 
https://github.com/open-spaced-repetition/fsrs4anki/wiki/Free-Spaced-Repetition-Scheduler
'''

class FSRS(nn.Module):
    def __init__(self, w):
        super(FSRS, self).__init__()
        self.w = nn.Parameter(torch.FloatTensor(w))
        self.zero = torch.FloatTensor([0.0])

    def forward(self, x, s, d):
        '''
        :param x: [review interval, review response]
        :param s: stability
        :param d: difficulty
        :return:
        '''

        x_new = x.clone()
        if x_new[1] == 1:
          x_new[1] = self.w[13]
        elif x_new[1] == 2:
            x_new[1] = self.w[14]
        elif x_new[1] == 3:
            x_new[1] = self.w[15]
        elif x_new[1] == 4:
            x_new[1] = self.w[16]

        if torch.equal(s, self.zero):
            # first learn, init memory states
            new_s = self.w[0] + self.w[1] * (x_new[1] - self.w[13])
            new_d = self.w[2] + self.w[3] * (x_new[1] - self.w[15])
            new_d = new_d.clamp(1, 10)
        else:
            r = torch.exp(np.log(0.9) * x_new[0] / s)
            new_d = d + self.w[4] * (x_new[1] - self.w[15])
            new_d = self.mean_reversion(self.w[2], new_d)
            new_d = new_d.clamp(1, 10)

            # recall
            if x[1] > 1:
                new_s = s * (1 + torch.exp(self.w[6]) *
                             (11 - new_d) *
                             torch.pow(s, self.w[7]) *
                             (torch.exp((1 - r) * self.w[8]) - 1))
            # forget
            else:
                new_s = self.w[9] * torch.pow(new_d, self.w[10]) * torch.pow(
                    s, self.w[11]) * torch.exp((1 - r) * self.w[12])
        return new_s, new_d

    def loss(self, s, t, r):
        return - (r * np.log(0.9) * t / s + (1 - r) * torch.log(1 - torch.exp(np.log(0.9) * t / s)))

    def mean_reversion(self, init, current):
        return self.w[5] * init + (1-self.w[5]) * current

class WeightClipper(object):
    def __init__(self, frequency=1):
        self.frequency = frequency

    def __call__(self, module):
        if hasattr(module, 'w'):
            w = module.w.data
            w[0] = w[0].clamp(0.1, 10)
            w[1] = w[1].clamp(0.1, 5)
            w[2] = w[2].clamp(1, 10)
            w[3] = w[3].clamp(-5, -0.1)
            w[4] = w[4].clamp(-5, -0.1)
            w[5] = w[5].clamp(0, 0.5)
            w[6] = w[6].clamp(0, 2)
            w[7] = w[7].clamp(-0.2, -0.01)
            w[8] = w[8].clamp(0.01, 1.5)
            w[9] = w[9].clamp(0.5, 5)
            w[10] = w[10].clamp(-2, -0.01)
            w[11] = w[11].clamp(0.01, 0.9)
            w[12] = w[12].clamp(0.01, 2)
            w[13] = w[13].clamp(1, 10)
            w[14] = w[14].clamp(1, 10)
            w[15] = w[15].clamp(1, 10)
            w[16] = w[16].clamp(1, 10)
            module.w.data = w

def lineToTensor(line):
    ivl = line[0].split(',')
    response = line[1].split(',')
    tensor = torch.zeros(len(response), 2)
    for li, response in enumerate(response):
        tensor[li][0] = int(ivl[li])
        tensor[li][1] = int(response)
    return tensor

user1823 commented 1 year ago

I tested this and got very impressive results:

	Log Loss	RMSE	R-squared
Original Optimizer	0.2241	0.0191	0.8728
Power forgetting curve	0.2231	0.0147	0.9234
Power difficulty	0.2234	0.0170	0.9026
Optimized grades	0.2220	0.0105	0.9632

As you can see, the optimized grades yielded the best results till now.

user1823 commented 1 year ago

Based on all the experiments we have conducted so far, I can conclude that it is best to replace as many fixed values with optimized values as possible.

Though while doing this, we will have to ensure that all the parameters remain interpretable.

Expertium commented 1 year ago

There aren't any fixed values left, except for e in exponents. Everything else is either a variable or a parameter. Replacing e^w_1⋅x with w_2^w_1⋅x isn't going to do anything, you can still achieve the same result with just w_1. So aside from making grades optimizable there's not much to do, we have to think of something completely new, like my idea to add lapses to S_forget.

user1823 commented 1 year ago

@Expertium, In your implementation of adaptive grades, the calculation of the stability and the difficulty is linked because both use the same value of the grade. Though the step per rating is different for both, I think that the results can be further improved if we optimize the grades for the stability and the difficulty separately.

Here is my implementation:

init_w = [1, 1, 5, -0.5, -0.5, 0.2, 1.4, -0.12, 0.8, 2, -0.2, 0.2, 1, 1, 2, 3, 4, 1, 2, 3, 4]
'''
w[0]: initial_stability_for_again_answer
w[1]: initial_stability_step_per_rating
w[2]: initial_difficulty_for_good_answer
w[3]: initial_difficulty_step_per_rating
w[4]: next_difficulty_step_per_rating
w[5]: next_difficulty_reversion_to_mean_speed (used to avoid ease hell)
w[6]: next_stability_factor_after_success
w[7]: next_stability_stabilization_decay_after_success
w[8]: next_stability_retrievability_gain_after_success
w[9]: next_stability_factor_after_failure
w[10]: next_stability_difficulty_decay_after_success
w[11]: next_stability_stability_gain_after_failure
w[12]: next_stability_retrievability_gain_after_failure
w[13]: stability_again_grade_value
w[14]: stability_hard_grade_value
w[15]: stability_good_grade_value
w[16]: stability_easy_grade_value
w[17]: difficulty_again_grade_value
w[18]: difficulty_hard_grade_value
w[19]: difficulty_good_grade_value
w[20]: difficulty_easy_grade_value
For more details about the parameters, please see: 
https://github.com/open-spaced-repetition/fsrs4anki/wiki/Free-Spaced-Repetition-Scheduler
'''

class FSRS(nn.Module):
    def __init__(self, w):
        super(FSRS, self).__init__()
        self.w = nn.Parameter(torch.FloatTensor(w))
        self.zero = torch.FloatTensor([0.0])

    def forward(self, x, s, d):
        '''
        :param x: [review interval, review response]
        :param s: stability
        :param d: difficulty
        :return:
        '''

        x_stab = x.clone()
        x_diff = x.clone()
        if x[1] == 1:
          x_stab[1] = self.w[13]
          x_diff[1] = self.w[17]
        elif x[1] == 2:
            x_stab[1] = self.w[14]
            x_diff[1] = self.w[18]
        elif x[1] == 3:
            x_stab[1] = self.w[15]
            x_diff[1] = self.w[19]
        elif x[1] == 4:
            x_stab[1] = self.w[16]
            x_diff[1] = self.w[20]

        if torch.equal(s, self.zero):
            # first learn, init memory states
            new_s = self.w[0] + self.w[1] * (x_stab[1] - self.w[13])
            new_d = self.w[2] + self.w[3] * (x_diff[1] - self.w[19])
            new_d = new_d.clamp(1, 10)
        else:
            r = torch.exp(np.log(0.9) * x[0] / s)
            new_d = d + self.w[4] * (x_diff[1] - self.w[19])
            new_d = self.mean_reversion(self.w[2], new_d)
            new_d = new_d.clamp(1, 10)

            # recall
            if x[1] > 1:
                new_s = s * (1 + torch.exp(self.w[6]) *
                             (11 - new_d) *
                             torch.pow(s, self.w[7]) *
                             (torch.exp((1 - r) * self.w[8]) - 1))
            # forget
            else:
                new_s = self.w[9] * torch.pow(new_d, self.w[10]) * torch.pow(
                    s, self.w[11]) * torch.exp((1 - r) * self.w[12])
        return new_s, new_d

    def loss(self, s, t, r):
        return - (r * np.log(0.9) * t / s + (1 - r) * torch.log(1 - torch.exp(np.log(0.9) * t / s)))

    def mean_reversion(self, init, current):
        return self.w[5] * init + (1-self.w[5]) * current

class WeightClipper(object):
    def __init__(self, frequency=1):
        self.frequency = frequency

    def __call__(self, module):
        if hasattr(module, 'w'):
            w = module.w.data
            w[0] = w[0].clamp(0.1, 10)
            w[1] = w[1].clamp(0.1, 5)
            w[2] = w[2].clamp(1, 10)
            w[3] = w[3].clamp(-5, -0.1)
            w[4] = w[4].clamp(-5, -0.1)
            w[5] = w[5].clamp(0, 0.5)
            w[6] = w[6].clamp(0, 2)
            w[7] = w[7].clamp(-0.2, -0.01)
            w[8] = w[8].clamp(0.01, 1.5)
            w[9] = w[9].clamp(0.5, 5)
            w[10] = w[10].clamp(-2, -0.01)
            w[11] = w[11].clamp(0.01, 0.9)
            w[12] = w[12].clamp(0.01, 2)
            w[13] = w[13].clamp(1, 10)
            w[14] = w[14].clamp(1, 10)
            w[15] = w[15].clamp(1, 10)
            w[16] = w[16].clamp(1, 10)
            w[17] = w[17].clamp(1, 10)
            w[18] = w[18].clamp(1, 10)
            w[19] = w[19].clamp(1, 10)
            w[20] = w[20].clamp(1, 10)
            module.w.data = w

def lineToTensor(line):
    ivl = line[0].split(',')
    response = line[1].split(',')
    tensor = torch.zeros(len(response), 2)
    for li, response in enumerate(response):
        tensor[li][0] = int(ivl[li])
        tensor[li][1] = int(response)
    return tensor

Edit: Corrected the three errors found in the code.

Expertium commented 1 year ago

This seems a bit excessive, but ok, I'll test it.

user1823 commented 1 year ago

I tested it. But, the results were worse than earlier. Did I make some mistake in the code?

My results:

	Log Loss	RMSE	R-squared
Original Optimizer	0.2241	0.0191	0.8728
Power forgetting curve	0.2231	0.0147	0.9234
Power difficulty	0.2234	0.0170	0.9026
Optimized grades (unified)	0.2220	0.0105	0.9632
Optimized grades (separated)	0.2222	0.0159	0.9182

Expertium commented 1 year ago

I'm also getting worse results. I think you made a mistake here

        x_new = x.clone()
        x_stab = x.clone()
        x_diff = x.clone()
        if x_new[1] == 1:
          x_stab[1] = self.w[13]
          x_diff[1] = self.w[17]
        elif x_new[1] == 2:
            x_stab[1] = self.w[14]
            x_diff[1] = self.w[18]
        elif x_stab[1] == 3:
            x_stab[1] = self.w[15]
            x_diff[1] = self.w[19]
        elif x_new[1] == 4:
            x_stab[1] = self.w[16]
            x_diff[1] = self.w[20]

elif x_stab[1] == 3: should be elif x_new[1] == 3:

user1823 commented 1 year ago

I'm also getting worse results. I think you made a mistake here elif x_stab[1] == 3: should be elif x_new[1] == 3:

Oh! You are right.

Expertium commented 1 year ago

I'm still getting worse results, no idea why

user1823 commented 1 year ago

elif x_stab[1] == 3: should be elif x_new[1] == 3:

Even after making the change, the result is the same. In retrospect, this makes sense because the initial value of x_new and x_stab is the same. So, the above error wouldn't have impacted the results.

I think that there is another undiscovered mistake in my code. Otherwise, I won't expect the results to be worse than before.

Edit: I found the error.

            # recall
            if x_new[1] != self.w[13]:
                new_s = s * (1 + torch.exp(self.w[6]) *

Here, x_new[1] should be replaced with x_stab[1].

Edit 2: Now, I think that this line should be replaced with if x[1] > 1: (the original code). This way, we can exclude the manual rating also.

user1823 commented 1 year ago

I ran the optimizer again after making the correction. I got the following results:

	Log Loss	RMSE	R-squared
Original Optimizer	0.2241	0.0191	0.8728
Power forgetting curve	0.2231	0.0147	0.9234
Power difficulty	0.2234	0.0170	0.9026
Optimized grades (unified)	0.2220	0.0105	0.9632
Optimized grades (separated)	0.2214	0.0114	0.9572

Note: I haven't yet made the last correction (replacing if x_new[1] != self.w[13]: with if x[1] > 1:).

Edit: I made this change, but the results were the same.

Expertium commented 1 year ago

Note: I haven't yet made the last correction (replacing if x_new[1] != self.w[13]: with if x[1] > 1:).

Good catch, I'll correct it in my original code (a few comments above) as well.

user1823 commented 1 year ago

Change the way S is calculated after a lapse ("Again") to include the number of lapses.

@Expertium, why do you think that this should produce better results? The lapses are already considered during the calculation of difficulty, which affects the calculation of the stability.

Expertium commented 1 year ago

The lapses are already considered during the calculation of difficulty, which affects the calculation of the stability.

I don't see that in the formulas on wiki, or in the code. Can you point to a specific formula/line of code?

user1823 commented 1 year ago

According to the following code, the difficulty increases with every again and hard rating (increases more with again rating).

https://github.com/open-spaced-repetition/fsrs4anki-helper/blob/1b4782d0db126235941d353739467fa67721df6c/reschedule.py#L49-L50

According to the following code, the greater the difficulty, the smaller is the increase in the stability.

https://github.com/open-spaced-repetition/fsrs4anki-helper/blob/1b4782d0db126235941d353739467fa67721df6c/reschedule.py#L56-L57

Expertium commented 1 year ago

According to the following code, the difficulty increases with every again and hard rating (increases more with again rating). https://github.com/open-spaced-repetition/fsrs4anki-helper/blob/1b4782d0db126235941d353739467fa67721df6c/reschedule.py#LL49-L50

Ah, that's what you mean. This isn't the same as adding up the total number of lapses and using the resulting number. Yes, difficulty depends on how many times you press "Again", but in a much less explicit way. In any case, in order out find out whether my idea works or no, we need Sherlock to implement it so we can test it.

user1823 commented 1 year ago

@Expertium, would you like to do the Wilcoxon signed-rank test to compare the results of the separated adaptive grades and the unified adaptive grades? The result of that test will make it easier to choose the one we would want to use.

The latest version of the code for separated adaptive grades is here: https://github.com/open-spaced-repetition/fsrs4anki/issues/239#issuecomment-1529710497

user1823 commented 1 year ago

@L-M-Sherlock, can you implement the following so that we can test it?

Change the way S is calculated after a lapse ("Again") to include the number of lapses. So from this: To this: Where L - number of lapses. +1 is needed to avoid multiplication by 0, which would result in S=0. w_13 should be negative, otherwise more lapses will lead to higher S, which is non-sense. This introduces 1 new parameter.

L-M-Sherlock commented 1 year ago

It requires extracting the lapse feature in the revlogs. And in my view, it could cause severe Ease Hell.

Expertium commented 1 year ago

Well, then implement adaptive grades as described here (or in a better way, if you know how):

2.1 Define the model
FSRS is a time-series model for predicting memory states.

init_w = [1, 1, 5, -0.5, -0.5, 0.2, 1.4, -0.12, 0.8, 2, -0.2, 0.2, 1, 1, 2, 3, 4]
'''
w[0]: initial_stability_for_again_answer
w[1]: initial_stability_step_per_rating
w[2]: initial_difficulty_for_good_answer
w[3]: initial_difficulty_step_per_rating
w[4]: next_difficulty_step_per_rating
w[5]: next_difficulty_reversion_to_mean_speed (used to avoid ease hell)
w[6]: next_stability_factor_after_success
w[7]: next_stability_stabilization_decay_after_success
w[8]: next_stability_retrievability_gain_after_success
w[9]: next_stability_factor_after_failure
w[10]: next_stability_difficulty_decay_after_success
w[11]: next_stability_stability_gain_after_failure
w[12]: next_stability_retrievability_gain_after_failure
w[13]: again_grade_value
w[14]: hard_grade_value
w[15]: good_grade_value
w[16]: easy_grade_value
For more details about the parameters, please see: 
https://github.com/open-spaced-repetition/fsrs4anki/wiki/Free-Spaced-Repetition-Scheduler
'''

class FSRS(nn.Module):
    def __init__(self, w):
        super(FSRS, self).__init__()
        self.w = nn.Parameter(torch.FloatTensor(w))
        self.zero = torch.FloatTensor([0.0])

    def forward(self, x, s, d):
        '''
        :param x: [review interval, review response]
        :param s: stability
        :param d: difficulty
        :return:
        '''

        x_new = x.clone()
        if x_new[1] == 1:
            x_new[1] = self.w[13]
        elif x_new[1] == 2:
            x_new[1] = self.w[14]
        elif x_new[1] == 3:
            x_new[1] = self.w[15]
        elif x_new[1] == 4:
            x_new[1] = self.w[16]

        if torch.equal(s, self.zero):
            # first learn, init memory states
            new_s = self.w[0] + self.w[1] * (x_new[1] - self.w[13])
            new_d = self.w[2] + self.w[3] * (x_new[1] - self.w[15])
            new_d = new_d.clamp(1, 10)
        else:
            r = torch.exp(np.log(0.9) * x_new[0] / s)
            new_d = d + self.w[4] * (x_new[1] - self.w[15])
            new_d = self.mean_reversion(self.w[2], new_d)
            new_d = new_d.clamp(1, 10)

            # recall
            if x[1] > 1:
                new_s = s * (1 + torch.exp(self.w[6]) *
                             (11 - new_d) *
                             torch.pow(s, self.w[7]) *
                             (torch.exp((1 - r) * self.w[8]) - 1))
            # forget
            else:
                new_s = self.w[9] * torch.pow(new_d, self.w[10]) * torch.pow(
                    s, self.w[11]) * torch.exp((1 - r) * self.w[12])
        return new_s, new_d

    def loss(self, s, t, r):
        return - (r * np.log(0.9) * t / s + (1 - r) * torch.log(1 - torch.exp(np.log(0.9) * t / s)))

    def mean_reversion(self, init, current):
        return self.w[5] * init + (1-self.w[5]) * current

class WeightClipper(object):
    def __init__(self, frequency=1):
        self.frequency = frequency

    def __call__(self, module):
        if hasattr(module, 'w'):
            w = module.w.data
            w[0] = w[0].clamp(0.1, 10)
            w[1] = w[1].clamp(0.1, 5)
            w[2] = w[2].clamp(1, 10)
            w[3] = w[3].clamp(-5, -0.1)
            w[4] = w[4].clamp(-5, -0.1)
            w[5] = w[5].clamp(0, 0.5)
            w[6] = w[6].clamp(0, 2)
            w[7] = w[7].clamp(-0.2, -0.01)
            w[8] = w[8].clamp(0.01, 1.5)
            w[9] = w[9].clamp(0.5, 5)
            w[10] = w[10].clamp(-2, -0.01)
            w[11] = w[11].clamp(0.01, 0.9)
            w[12] = w[12].clamp(0.01, 2)
            w[13] = w[13].clamp(1, 10)
            w[14] = w[14].clamp(1, 10)
            w[15] = w[15].clamp(1, 10)
            w[16] = w[16].clamp(1, 10)
            module.w.data = w

def lineToTensor(line):
    ivl = line[0].split(',')
    response = line[1].split(',')
    tensor = torch.zeros(len(response), 2)
    for li, response in enumerate(response):
        tensor[li][0] = int(ivl[li])
        tensor[li][1] = int(response)
    return tensor

Expertium commented 1 year ago

Here's the updated table:

Making 2 different optimizable grades for S and D improved performance when compared to the old algorithm, but not when compared to just having optimizable grades at all. The p-value for that is not shown in the table (in the table I only compare changes to the baseline), but it was much >0.01, suggesting that making different grades for S and D doesn't improve performance.

user1823 commented 1 year ago

Here, I have tried to extract from the above comments what needs to be done now so that we don't miss anything.

Make the grade values optimizable parameters. Currently, Again = 1, Hard = 2, Good = 3, Easy = 4. It is proposed that we make them optimizable as well, and clamp them. The current progress on implementing this is shared by @Expertium here: https://github.com/open-spaced-repetition/fsrs4anki/issues/239#issuecomment-1531352588
After the adaptive grades are implemented, we should do away with the EasyBonus setting because the optimizer would automatically calculate how much the stability/difficulty/interval should be changed with the press of the Easy button.
For the sake of consistency, it is suggested to either replace $w_9$ in the stability_after_lapse formula with $e^{w_9}$ or replace $e^{w_6}$ in the stability_after_recall formula with $w_6$.
For the sake of clarity, it is suggested that all formulas should be rewritten in such a way that parameters are always positive. For example, if you have D^w (w is a parameter) and w is is always positive - good, don't change anything. If w is always negative - rewrite the formula as D^-w and keep w positive. This should make it easier to read the wiki.

Expertium commented 1 year ago

I have another idea, but I can't implement it. Basically, I want to add something like a self-correcting property to the algorithm so that it can correct its own bad results.

Calculate a moving average using the following formula: R_avg_new = (1 - w_13) R_avg_previous + w_13 response. If user pressed Again then response=0, else response=1. w_13 is a new parameter, it takes values between 0.01 and 0.99. If this is the first repetition, then instead of R_avg_previous we use w_14, which also takes values between 0.01 and 0.99. w_13 is the weight assigned to new responses, and w_14 is the initial value. Although w_14 may end up being unnecessary - we can just use requested retention instead.
Calculate the correction factor C=ln(R_avg)/ln(R_requested). In the optimizer, we can just set R_requested to 0.9. If this is the first repetition, then set C to 1.
Multiply S by C (both S_r and S_f). Here are the formulas:

The idea is that if R_avg is close to R_requested, then the ratio of their logarithms will be close to 1, so it won't affect the calculation of S very much. In other words, a good algorithm doesn't need to correct itself. If R_avg is significantly higher than R_requested, then that means that the algorithm is overestimating S and S needs to be decreased (and the ratio of logs in this case is <1). If R_avg is significantly lower than R_requested, then the algorithm is underestimating S and it needs to be increased (and the ratio of logs in this case is >1). This is especially helpful for very easy and very hard cards, for which the algorithm works poorly. Also, R_avg can bounce up very quickly after a lapse (though it depends on w_13, of course), so user won't be stuck in Ease Hell. If this ends up performing poorly due to C taking very large or very small values, I also already have an idea how to deal with that. And in case someone is wondering why logarithms, that's because ln(R_1)/ln(R_2)=S_2/S_1, where R_1=0.9^t/S_1 and R_2=0.9^t/S_2.

On a somewhat unrelated note, @L-M-Sherlock, is it possible to have FSRS enabled for every single repetition? No built-in Anki learning steps or re-learning steps, those will become optimizable parameters (or functions) as well.

user1823 commented 1 year ago

@Expertium, I like the idea. But, can you explain how did you obtain the following equation?

R_avg_new = (1 - w_13) R_avg_previous + w_13 response

Edit: I got it.

The above equation can be re-written as

R_avg_new = R_avg_previous + w_13 * (response - R_avg_previous)

So, average new retention is calculated by adding a certain fraction (optimized) of the difference between the response and the previous average retention to the previous average retention.

Although w_14 may end up being unnecessary - we can just use requested retention instead.

Comparing the equation for the first review to that for any subsequent review, the w_14 is equivalent to the average retention before the first review. I don't think that it can be taken as the requested retention.

Edit 2: Doesn't this idea (the addition of the C factor) seem to be a heuristic approach? I mean that the original FSRS algorithm is based on a lot of research conducted by @L-M-Sherlock, while this approach seems like we are just trying to modulate the equations and trying to get the desired results by trial and error.

Edit 3: It dawned on me that this is basically what research is. You first make an hypothesis based on certain considerations. You then test it through various means, and then modify the hypothesis based on the outcomes of the experiments.

But, for this approach to give good results, we shall need to test it out with data from large number of users. However, initial testing with some users may suffice to eliminate most of the bad ideas.

Expertium commented 1 year ago

But, can you explain how did you obtain the following equation?

It's just exponential smoothing. You can read more here: https://en.wikipedia.org/wiki/Exponential_smoothing

The above equation can be re-written as R_avg_new = R_avg_previous + w_13 * (response - R_avg_previous)

I think R_avg_new = (1 - w_13) R_avg_previous + w_13 response is more clear, but whatever.

Edit 2: Doesn't this idea (the addition of the C factor) seem to be a heuristic approach? I mean that the original FSRS algorithm is based on a lot of research conducted by @L-M-Sherlock, while this approach seems like we are just trying to modulate the equations and trying to get the desired results by trial and error.

Yes, but I'm not as smart as Woz, and I'm also not going to spend 35 years collecting data and improving algorithms just to produce a brilliant idea.

Edit 3: It dawned on me that this is basically what research is. You first make an hypothesis based on certain considerations. You then test it through various means, and then modify the hypothesis based on the outcomes of the experiments.

Yeah, although you can do it in different ways. As I mentioned in another issue, we can adopt either the "Anything that decreases the loss is good, no matter how ad hoc" philosophy, or the "The algorithm must have a solid theoretical foundation and all values must be easily interpretable, no 'black boxes'" philosophy. If you take the first philosophy to its logical extreme, you will end up with a neural network - those things work wonders, but trying to interpet how exactly they work is almost futile. If you take the second philosophy to its logical extreme, you will end up with SM-18...after 35 years of collecting data and doing big brain thinking.

L-M-Sherlock commented 1 year ago

Based on recent discussion, I implement these variants for FSRS optimizer and record their performance in my collection:

Variant	Colab Link	Log Loss	RMSE	R-squared	Number of extra parameters
Original Optimizer	v3.17.1	0.3163	0.0209	0.8916	0
Power forgetting curve (fixed power)	Expt/power-forgetting-curve	0.3147	0.0200	0.9057	0
Power difficulty (adaptive power)	Expt/power-function-for-difficulty	0.3151	0.0210	0.8932	1
Power stabilization curve	Expt/power-stabilization-curve	0.3160	0.0216	0.8862	0
Trainable ratings (only for difficulty)	Expt/trainable-ratings	0.3159	0.0209	0.8917	4

When we discuss the further improvement of FSRS, we shouldn't only consider the loss. The number of extra parameters is also important. As the number of parameters increases, the risk of overfitting becomes larger.

Expertium commented 1 year ago

Hmm, that is strange - your loss and RMSE barely changed when you implemented trainable grades.

Also, shouldn't this formula be different? new_s = self.w[0] + self.w[1] * (x[1] - 1) x[1]-1 should be x[1]-self.w[13]

L-M-Sherlock commented 1 year ago

new_s = self.w[0] + self.w[1] * (x[1] - 1) x[1]-1 should be x[1]-self.w[13]

In pre-train stage, the trainable parameters should only be initial stability. Introducing trainable ratings here would result in inaccurate initial stability.

Expertium commented 1 year ago

Ok, I'll test this later today and compare the results to what I had with my own implementation of trainable grades.

Expertium commented 1 year ago

@L-M-Sherlock Regarding overfitting - why not split data into 2 datasets, "training" and "test" set? Optimize parameters on training set, then see how big the loss is on the test set. If test set loss is much higher than training set loss, that means the model is overfitted. That's a standard practice in ML, in fact, I'm surprised you haven't done it yet.

On an unrelated note, I don't see clamping for new parameters (for grades) in the code.

L-M-Sherlock commented 1 year ago

@L-M-Sherlock Regarding overfitting - why not split data into 2 datasets, "training" and "test" set? Optimize parameters on training set, then see how big the loss is on the test set. If test set loss is much higher than training set loss, that means the model is overfitted. That's a standard practice in ML, in fact, I'm surprised you haven't done it yet.

Because it will cost more time of optimization. In the start of this project, I fixed the structure of model, so it didn't need to test.

L-M-Sherlock commented 1 year ago

On an unrelated note, I don't see clamping for new parameters (for grades) in the code.

Because it is experimental and today is workday. I don't have too much time here.

user1823 commented 1 year ago

I tried all these versions of the optimizer. My observations:

For my collection, the best results were obtained with adaptive grades (the version implemented by @Expertium).
The second spot is somewhat of a tie between power forgetting curve (adaptive power) and sm-stabilization-curve.
The power stabilization curve worked well in terms of reducing the log loss, but not the RMSE.
The results with Sherlock's version of adaptive grades are worse than those with Expertium's version.

Here is a table comparing the results:	Log Loss	RMSE	R-squared
Original Optimizer	0.2241	0.0191	0.8728
Power forgetting curve (fixed power)	0.2240	0.0173	0.8949
Power forgetting curve (adaptive power)	0.2231	0.0147	0.9234
Power difficulty (adaptive power)	0.2234	0.0170	0.9026
Adaptive grades (by @Expertium)	0.2220	0.0105	0.9632
sm-stabilization-curve (found this as branch on GitHub)	0.2228	0.0153	0.9179
Power stabilization curve	0.2222	0.0177	0.8946
Trainable ratings (only for difficulty)	0.2233	0.0184	0.8842

Expertium commented 1 year ago

I would appreciate it if you tried Sherlock's version of adaptive grades to see if it produces the same or different results compared to mine

Also, what is sm stabilization curve?

user1823 commented 1 year ago

I would appreciate it if you tried Sherlock's version of adaptive grades to see if it produces the same or different results compared to mine

I have already tried. See the last row in the table. The results with Sherlock's version are worse than with your version.

Expertium commented 1 year ago

Oh, my bad, I see now.

user1823 commented 1 year ago

Also, what is sm stabilization curve?

I found it as a branch on this GitHub repository. It is probably an incomplete implementation of something @L-M-Sherlock is trying to implement. While I was testing the other versions, I thought why not test this one too even if it is incomplete. (All tests run in parallel in separate tabs.)

Link: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/sm-stabilization-curve

Colab Link: https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Expt%2Fsm-stabilization-curve/fsrs4anki_optimizer.ipynb

Expertium commented 1 year ago

I haven't fully tested Sherlock's version yet, but I think I have identified the problem. I have replaced this line: new_s = self.w[0] + self.w[1] * (x_new[1] - self.w[13]) in my own version with this line: new_s = self.w[0] + self.w[1] * (x[1] - 1) and after that I got almost the same results as with the original algorithm. I changed it back and loss and RMSE went down again. In other words, for some strange reason not using trainable grades in this formula completely negates any benefit that could be gained otherwise.

L-M-Sherlock commented 1 year ago

Also, what is sm stabilization curve?

https://supermemo.guru/wiki/Stabilization_curve

user1823 commented 1 year ago

For some strange reason not using trainable grades in this formula completely negates any benefit that could be gained otherwise.

This probably means that trainable grades are more important for the calculation of stability than for the calculation of difficulty.

This is also probably the reason I didn't get good results when I tried to train the grades for the stability and the difficulty separately.

Expertium commented 1 year ago

I'm going to try it with Sherlock's version as well, to see if it makes a difference. EDIT: yes, Sherlock's version has the same problem. new_s = self.w[0] + self.w[1] * (x[1] - 1) doesn't work as well as new_s = self.w[0] + self.w[1] * (self.w[12+x[1].long()] - self.w[13]) I'm going to try something interesting - I will replace difficulty with a constant. If that doesn't perform worse than baseline, then it means that the current definition of difficulty is useless.

L-M-Sherlock commented 1 year ago

This probably means that trainable grades are more important for the calculation of stability than for the calculation of difficulty.

However, except for the initial stability, the ratings don't affect the stability directly.

user1823 commented 1 year ago

This probably means that trainable grades are more important for the calculation of stability than for the calculation of difficulty.

However, except for the initial stability, the ratings don't affect the stability directly.

Then, it probably means that all (or most) of the benefit that could be gained by using adaptive grades is because of how they affect the initial stability.

In other words, it probably means that the difference between the initial stabilities for cards marked again, hard, good and easy is not equal.

For example, if the difference between the initial stabilities of cards marked again and hard is 1, then the difference between the initial stabilities of cards marked hard and good is not necessarily 1 but can also be 1.5 or 2.

(Currently, FSRS considers the difference between the initial stabilities to be same, which is w[1].)

open-spaced-repetition / fsrs4anki

[Feature Request] Sharing ideas for further improvement of the algorithm #239