Closed Expertium closed 1 year ago
Make grade values optimizable parameters. Currently Again=1, Hard=2, Good=3, Easy=4. I propose that we make them optimizable as well, and clamp them (although I'm not sure about the range, should it be [1, 4] or something different?). This introduces 4 new parameters.
I second this. It would allow us to account for the fact that different users use the grades differently.
Good
and Again
buttons.Good
and Again
buttons but also use the Hard
and the Easy
buttons intermittently.Edit: If we implement this, we can also do away with the easy bonus
setting because the optimizer would automatically calculate how much the stability/difficulty/interval should be changed with the press of the Easy
button.
Since Sherlock seems to be absent today, I decided to use my meager Python knowledge to implement number 2 myself. Thankfully, it's way easier than number 1 or number 3, I don't think I will be able to do those on my own, both of those would require major changes.
Unfortunately, it doesn't seem like this worked. Both loss and RMSE are about the same, and the new parameter (power for S) itself barely changes and is very close to 1 (usually around 0.95-1.05), which means that any large deviation from 1 is not optimal. Although increasing the learning rate seems to produce better results, it's possible that increasing the learning rate would help even without the new parameter. I did the usual procedure, 7 decks + collection, then Wilcoxon signed-rank test, with lr=5e-4 (same as the one I used when testing without any new changes, to ensure a fair comparison): log-loss is 0.1% better (which is practically 0) on average, RMSE is 0.2% better, and the p-values are much greater than 0.05, so there is no statistically significant difference in performance between the old and new algorithms.
But there are some good news: lr=5e-3 generally produces better results than lr=5e-4 for many different versions of the algorithm, so the default value should be 5e-3 for better results.
lr=5e-3 generally produces better results than lr=5e-4 for many different versions of the algorithm, so the default value should be 5e-3 for better results.
Unfortunately, this is not always the case. I tried using 5e-3 with the power forgetting curve (with optimizable power) and I got the following results:
Learning Rate | Log Loss | RMSE | R-squared |
---|---|---|---|
5e-4 | 0.2231 | 0.0147 | 0.9234 |
5e-3 | 0.2225 | 0.0162 | 0.9129 |
Also, the intervals rose too quickly with the 5e-3 as shown below:
lr = 5e-4:
lr = 5e-3:
Interesting, in my case it actually helped with the power forgetting curve.
We could run 2 optimizations with different lr
and then just choose whichever set of parameters minimizes the log-loss better, but that would almost double the total time needed.
After some trial and error, I figured out how to implement adaptive grades. Hopefully, I didn't mess anything up.
I did the standard procedure (lr=5e-4, again, for the sake of consistency with previous tests, even though 5e-3 is better), and the results are decent. Log-loss went down by 2% on average, RMSE went down by around 8% on average, and the difference in performance between this and the old algorithm is statistically significant with p<0.01. While this isn't quite as impressive as the adaptive power forgetting curve, it's progress.
It's probably too early to worry, but I'm concerned about how little log-loss changes. Even the adaptive power forgetting curve only improved log-loss by 4% on average, this one improved log-loss by 2%. I feel like we won't be able to significantly decrease it unless we do something extremely clever and extremely complicated. Interestingly, RMSE improved more than log-loss in both cases (adaptive power forgetting curve and adaptive grades).
Also, I have an idea for a non-linear transformation of difficulty, I'll test it and see if it's better than power function for difficulty. EDIT: my transformation is weird. For some reason optimizer barely changes the new parameters that I added. Whatever initial values I set, the "optimal" values are very close to them, and the loss isn't improving. I guess this idea didn't work out.
Also, at some point I should make a table comparing different changes.
Here's a table with different modifications. The old FSRS algorithm is the baseline. A value less than 100% means that this modification performs better than the baseline, a value greater than 100% means that this modification performs worse than the baseline. p-value < 0.01 means that it's unlikely to be a coincidence and there really is a difference between a particular improved version and the baseline. So far, the only modification worth implementing is adaptive (aka optimizable) grades. Power forgetting curve changes the meaning of S, so we're not going to do that, I suppose.
@L-M-Sherlock here's my spaghetti code implementation of optimizable values for Again-Hard-Good-Easy. I tested it, and it worked well (see the table above in my previous comment). It would be great if you implemented this (and there's prbably a better way to implement it, I'll leave it to you) so we could all test it.
2.1 Define the model
FSRS is a time-series model for predicting memory states.
init_w = [1, 1, 5, -0.5, -0.5, 0.2, 1.4, -0.12, 0.8, 2, -0.2, 0.2, 1, 1, 2, 3, 4]
'''
w[0]: initial_stability_for_again_answer
w[1]: initial_stability_step_per_rating
w[2]: initial_difficulty_for_good_answer
w[3]: initial_difficulty_step_per_rating
w[4]: next_difficulty_step_per_rating
w[5]: next_difficulty_reversion_to_mean_speed (used to avoid ease hell)
w[6]: next_stability_factor_after_success
w[7]: next_stability_stabilization_decay_after_success
w[8]: next_stability_retrievability_gain_after_success
w[9]: next_stability_factor_after_failure
w[10]: next_stability_difficulty_decay_after_success
w[11]: next_stability_stability_gain_after_failure
w[12]: next_stability_retrievability_gain_after_failure
w[13]: again_grade_value
w[14]: hard_grade_value
w[15]: good_grade_value
w[16]: easy_grade_value
For more details about the parameters, please see:
https://github.com/open-spaced-repetition/fsrs4anki/wiki/Free-Spaced-Repetition-Scheduler
'''
class FSRS(nn.Module):
def __init__(self, w):
super(FSRS, self).__init__()
self.w = nn.Parameter(torch.FloatTensor(w))
self.zero = torch.FloatTensor([0.0])
def forward(self, x, s, d):
'''
:param x: [review interval, review response]
:param s: stability
:param d: difficulty
:return:
'''
x_new = x.clone()
if x_new[1] == 1:
x_new[1] = self.w[13]
elif x_new[1] == 2:
x_new[1] = self.w[14]
elif x_new[1] == 3:
x_new[1] = self.w[15]
elif x_new[1] == 4:
x_new[1] = self.w[16]
if torch.equal(s, self.zero):
# first learn, init memory states
new_s = self.w[0] + self.w[1] * (x_new[1] - self.w[13])
new_d = self.w[2] + self.w[3] * (x_new[1] - self.w[15])
new_d = new_d.clamp(1, 10)
else:
r = torch.exp(np.log(0.9) * x_new[0] / s)
new_d = d + self.w[4] * (x_new[1] - self.w[15])
new_d = self.mean_reversion(self.w[2], new_d)
new_d = new_d.clamp(1, 10)
# recall
if x[1] > 1:
new_s = s * (1 + torch.exp(self.w[6]) *
(11 - new_d) *
torch.pow(s, self.w[7]) *
(torch.exp((1 - r) * self.w[8]) - 1))
# forget
else:
new_s = self.w[9] * torch.pow(new_d, self.w[10]) * torch.pow(
s, self.w[11]) * torch.exp((1 - r) * self.w[12])
return new_s, new_d
def loss(self, s, t, r):
return - (r * np.log(0.9) * t / s + (1 - r) * torch.log(1 - torch.exp(np.log(0.9) * t / s)))
def mean_reversion(self, init, current):
return self.w[5] * init + (1-self.w[5]) * current
class WeightClipper(object):
def __init__(self, frequency=1):
self.frequency = frequency
def __call__(self, module):
if hasattr(module, 'w'):
w = module.w.data
w[0] = w[0].clamp(0.1, 10)
w[1] = w[1].clamp(0.1, 5)
w[2] = w[2].clamp(1, 10)
w[3] = w[3].clamp(-5, -0.1)
w[4] = w[4].clamp(-5, -0.1)
w[5] = w[5].clamp(0, 0.5)
w[6] = w[6].clamp(0, 2)
w[7] = w[7].clamp(-0.2, -0.01)
w[8] = w[8].clamp(0.01, 1.5)
w[9] = w[9].clamp(0.5, 5)
w[10] = w[10].clamp(-2, -0.01)
w[11] = w[11].clamp(0.01, 0.9)
w[12] = w[12].clamp(0.01, 2)
w[13] = w[13].clamp(1, 10)
w[14] = w[14].clamp(1, 10)
w[15] = w[15].clamp(1, 10)
w[16] = w[16].clamp(1, 10)
module.w.data = w
def lineToTensor(line):
ivl = line[0].split(',')
response = line[1].split(',')
tensor = torch.zeros(len(response), 2)
for li, response in enumerate(response):
tensor[li][0] = int(ivl[li])
tensor[li][1] = int(response)
return tensor
I tested this and got very impressive results:
Log Loss | RMSE | R-squared | |
---|---|---|---|
Original Optimizer | 0.2241 | 0.0191 | 0.8728 |
Power forgetting curve | 0.2231 | 0.0147 | 0.9234 |
Power difficulty | 0.2234 | 0.0170 | 0.9026 |
Optimized grades | 0.2220 | 0.0105 | 0.9632 |
As you can see, the optimized grades yielded the best results till now.
Based on all the experiments we have conducted so far, I can conclude that it is best to replace as many fixed values with optimized values as possible.
Though while doing this, we will have to ensure that all the parameters remain interpretable.
There aren't any fixed values left, except for e
in exponents. Everything else is either a variable or a parameter. Replacing e^w_1⋅x with w_2^w_1⋅x isn't going to do anything, you can still achieve the same result with just w_1.
So aside from making grades optimizable there's not much to do, we have to think of something completely new, like my idea to add lapses to S_forget.
@Expertium, In your implementation of adaptive grades, the calculation of the stability and the difficulty is linked because both use the same value of the grade. Though the step per rating is different for both, I think that the results can be further improved if we optimize the grades for the stability and the difficulty separately.
Here is my implementation:
init_w = [1, 1, 5, -0.5, -0.5, 0.2, 1.4, -0.12, 0.8, 2, -0.2, 0.2, 1, 1, 2, 3, 4, 1, 2, 3, 4]
'''
w[0]: initial_stability_for_again_answer
w[1]: initial_stability_step_per_rating
w[2]: initial_difficulty_for_good_answer
w[3]: initial_difficulty_step_per_rating
w[4]: next_difficulty_step_per_rating
w[5]: next_difficulty_reversion_to_mean_speed (used to avoid ease hell)
w[6]: next_stability_factor_after_success
w[7]: next_stability_stabilization_decay_after_success
w[8]: next_stability_retrievability_gain_after_success
w[9]: next_stability_factor_after_failure
w[10]: next_stability_difficulty_decay_after_success
w[11]: next_stability_stability_gain_after_failure
w[12]: next_stability_retrievability_gain_after_failure
w[13]: stability_again_grade_value
w[14]: stability_hard_grade_value
w[15]: stability_good_grade_value
w[16]: stability_easy_grade_value
w[17]: difficulty_again_grade_value
w[18]: difficulty_hard_grade_value
w[19]: difficulty_good_grade_value
w[20]: difficulty_easy_grade_value
For more details about the parameters, please see:
https://github.com/open-spaced-repetition/fsrs4anki/wiki/Free-Spaced-Repetition-Scheduler
'''
class FSRS(nn.Module):
def __init__(self, w):
super(FSRS, self).__init__()
self.w = nn.Parameter(torch.FloatTensor(w))
self.zero = torch.FloatTensor([0.0])
def forward(self, x, s, d):
'''
:param x: [review interval, review response]
:param s: stability
:param d: difficulty
:return:
'''
x_stab = x.clone()
x_diff = x.clone()
if x[1] == 1:
x_stab[1] = self.w[13]
x_diff[1] = self.w[17]
elif x[1] == 2:
x_stab[1] = self.w[14]
x_diff[1] = self.w[18]
elif x[1] == 3:
x_stab[1] = self.w[15]
x_diff[1] = self.w[19]
elif x[1] == 4:
x_stab[1] = self.w[16]
x_diff[1] = self.w[20]
if torch.equal(s, self.zero):
# first learn, init memory states
new_s = self.w[0] + self.w[1] * (x_stab[1] - self.w[13])
new_d = self.w[2] + self.w[3] * (x_diff[1] - self.w[19])
new_d = new_d.clamp(1, 10)
else:
r = torch.exp(np.log(0.9) * x[0] / s)
new_d = d + self.w[4] * (x_diff[1] - self.w[19])
new_d = self.mean_reversion(self.w[2], new_d)
new_d = new_d.clamp(1, 10)
# recall
if x[1] > 1:
new_s = s * (1 + torch.exp(self.w[6]) *
(11 - new_d) *
torch.pow(s, self.w[7]) *
(torch.exp((1 - r) * self.w[8]) - 1))
# forget
else:
new_s = self.w[9] * torch.pow(new_d, self.w[10]) * torch.pow(
s, self.w[11]) * torch.exp((1 - r) * self.w[12])
return new_s, new_d
def loss(self, s, t, r):
return - (r * np.log(0.9) * t / s + (1 - r) * torch.log(1 - torch.exp(np.log(0.9) * t / s)))
def mean_reversion(self, init, current):
return self.w[5] * init + (1-self.w[5]) * current
class WeightClipper(object):
def __init__(self, frequency=1):
self.frequency = frequency
def __call__(self, module):
if hasattr(module, 'w'):
w = module.w.data
w[0] = w[0].clamp(0.1, 10)
w[1] = w[1].clamp(0.1, 5)
w[2] = w[2].clamp(1, 10)
w[3] = w[3].clamp(-5, -0.1)
w[4] = w[4].clamp(-5, -0.1)
w[5] = w[5].clamp(0, 0.5)
w[6] = w[6].clamp(0, 2)
w[7] = w[7].clamp(-0.2, -0.01)
w[8] = w[8].clamp(0.01, 1.5)
w[9] = w[9].clamp(0.5, 5)
w[10] = w[10].clamp(-2, -0.01)
w[11] = w[11].clamp(0.01, 0.9)
w[12] = w[12].clamp(0.01, 2)
w[13] = w[13].clamp(1, 10)
w[14] = w[14].clamp(1, 10)
w[15] = w[15].clamp(1, 10)
w[16] = w[16].clamp(1, 10)
w[17] = w[17].clamp(1, 10)
w[18] = w[18].clamp(1, 10)
w[19] = w[19].clamp(1, 10)
w[20] = w[20].clamp(1, 10)
module.w.data = w
def lineToTensor(line):
ivl = line[0].split(',')
response = line[1].split(',')
tensor = torch.zeros(len(response), 2)
for li, response in enumerate(response):
tensor[li][0] = int(ivl[li])
tensor[li][1] = int(response)
return tensor
Edit: Corrected the three errors found in the code.
This seems a bit excessive, but ok, I'll test it.
I tested it. But, the results were worse than earlier. Did I make some mistake in the code?
My results:
Log Loss | RMSE | R-squared | |
---|---|---|---|
Original Optimizer | 0.2241 | 0.0191 | 0.8728 |
Power forgetting curve | 0.2231 | 0.0147 | 0.9234 |
Power difficulty | 0.2234 | 0.0170 | 0.9026 |
Optimized grades (unified) | 0.2220 | 0.0105 | 0.9632 |
Optimized grades (separated) | 0.2222 | 0.0159 | 0.9182 |
I'm also getting worse results. I think you made a mistake here
x_new = x.clone()
x_stab = x.clone()
x_diff = x.clone()
if x_new[1] == 1:
x_stab[1] = self.w[13]
x_diff[1] = self.w[17]
elif x_new[1] == 2:
x_stab[1] = self.w[14]
x_diff[1] = self.w[18]
elif x_stab[1] == 3:
x_stab[1] = self.w[15]
x_diff[1] = self.w[19]
elif x_new[1] == 4:
x_stab[1] = self.w[16]
x_diff[1] = self.w[20]
elif x_stab[1] == 3: should be elif x_new[1] == 3:
I'm also getting worse results. I think you made a mistake here elif x_stab[1] == 3: should be elif x_new[1] == 3:
Oh! You are right.
I'm still getting worse results, no idea why
elif x_stab[1] == 3: should be elif x_new[1] == 3:
Even after making the change, the result is the same. In retrospect, this makes sense because the initial value of x_new and x_stab is the same. So, the above error wouldn't have impacted the results.
I think that there is another undiscovered mistake in my code. Otherwise, I won't expect the results to be worse than before.
Edit: I found the error.
# recall
if x_new[1] != self.w[13]:
new_s = s * (1 + torch.exp(self.w[6]) *
Here, x_new[1]
should be replaced with x_stab[1]
.
Edit 2: Now, I think that this line should be replaced with if x[1] > 1:
(the original code). This way, we can exclude the manual rating also.
I ran the optimizer again after making the correction. I got the following results:
Log Loss | RMSE | R-squared | |
---|---|---|---|
Original Optimizer | 0.2241 | 0.0191 | 0.8728 |
Power forgetting curve | 0.2231 | 0.0147 | 0.9234 |
Power difficulty | 0.2234 | 0.0170 | 0.9026 |
Optimized grades (unified) | 0.2220 | 0.0105 | 0.9632 |
Optimized grades (separated) | 0.2214 | 0.0114 | 0.9572 |
Note: I haven't yet made the last correction (replacing if x_new[1] != self.w[13]:
with if x[1] > 1:
).
Edit: I made this change, but the results were the same.
Note: I haven't yet made the last correction (replacing
if x_new[1] != self.w[13]:
withif x[1] > 1:
).
Good catch, I'll correct it in my original code (a few comments above) as well.
Change the way S is calculated after a lapse ("Again") to include the number of lapses.
@Expertium, why do you think that this should produce better results? The lapses are already considered during the calculation of difficulty, which affects the calculation of the stability.
The lapses are already considered during the calculation of difficulty, which affects the calculation of the stability.
I don't see that in the formulas on wiki, or in the code. Can you point to a specific formula/line of code?
According to the following code, the difficulty increases with every again and hard rating (increases more with again rating).
According to the following code, the greater the difficulty, the smaller is the increase in the stability.
According to the following code, the difficulty increases with every again and hard rating (increases more with again rating). https://github.com/open-spaced-repetition/fsrs4anki-helper/blob/1b4782d0db126235941d353739467fa67721df6c/reschedule.py#LL49-L50
Ah, that's what you mean. This isn't the same as adding up the total number of lapses and using the resulting number. Yes, difficulty depends on how many times you press "Again", but in a much less explicit way. In any case, in order out find out whether my idea works or no, we need Sherlock to implement it so we can test it.
@Expertium, would you like to do the Wilcoxon signed-rank test to compare the results of the separated adaptive grades and the unified adaptive grades? The result of that test will make it easier to choose the one we would want to use.
The latest version of the code for separated adaptive grades is here: https://github.com/open-spaced-repetition/fsrs4anki/issues/239#issuecomment-1529710497
@L-M-Sherlock, can you implement the following so that we can test it?
- Change the way S is calculated after a lapse ("Again") to include the number of lapses. So from this: To this: Where L - number of lapses. +1 is needed to avoid multiplication by 0, which would result in S=0. w_13 should be negative, otherwise more lapses will lead to higher S, which is non-sense. This introduces 1 new parameter.
It requires extracting the lapse
feature in the revlogs. And in my view, it could cause severe Ease Hell.
Well, then implement adaptive grades as described here (or in a better way, if you know how):
2.1 Define the model
FSRS is a time-series model for predicting memory states.
init_w = [1, 1, 5, -0.5, -0.5, 0.2, 1.4, -0.12, 0.8, 2, -0.2, 0.2, 1, 1, 2, 3, 4]
'''
w[0]: initial_stability_for_again_answer
w[1]: initial_stability_step_per_rating
w[2]: initial_difficulty_for_good_answer
w[3]: initial_difficulty_step_per_rating
w[4]: next_difficulty_step_per_rating
w[5]: next_difficulty_reversion_to_mean_speed (used to avoid ease hell)
w[6]: next_stability_factor_after_success
w[7]: next_stability_stabilization_decay_after_success
w[8]: next_stability_retrievability_gain_after_success
w[9]: next_stability_factor_after_failure
w[10]: next_stability_difficulty_decay_after_success
w[11]: next_stability_stability_gain_after_failure
w[12]: next_stability_retrievability_gain_after_failure
w[13]: again_grade_value
w[14]: hard_grade_value
w[15]: good_grade_value
w[16]: easy_grade_value
For more details about the parameters, please see:
https://github.com/open-spaced-repetition/fsrs4anki/wiki/Free-Spaced-Repetition-Scheduler
'''
class FSRS(nn.Module):
def __init__(self, w):
super(FSRS, self).__init__()
self.w = nn.Parameter(torch.FloatTensor(w))
self.zero = torch.FloatTensor([0.0])
def forward(self, x, s, d):
'''
:param x: [review interval, review response]
:param s: stability
:param d: difficulty
:return:
'''
x_new = x.clone()
if x_new[1] == 1:
x_new[1] = self.w[13]
elif x_new[1] == 2:
x_new[1] = self.w[14]
elif x_new[1] == 3:
x_new[1] = self.w[15]
elif x_new[1] == 4:
x_new[1] = self.w[16]
if torch.equal(s, self.zero):
# first learn, init memory states
new_s = self.w[0] + self.w[1] * (x_new[1] - self.w[13])
new_d = self.w[2] + self.w[3] * (x_new[1] - self.w[15])
new_d = new_d.clamp(1, 10)
else:
r = torch.exp(np.log(0.9) * x_new[0] / s)
new_d = d + self.w[4] * (x_new[1] - self.w[15])
new_d = self.mean_reversion(self.w[2], new_d)
new_d = new_d.clamp(1, 10)
# recall
if x[1] > 1:
new_s = s * (1 + torch.exp(self.w[6]) *
(11 - new_d) *
torch.pow(s, self.w[7]) *
(torch.exp((1 - r) * self.w[8]) - 1))
# forget
else:
new_s = self.w[9] * torch.pow(new_d, self.w[10]) * torch.pow(
s, self.w[11]) * torch.exp((1 - r) * self.w[12])
return new_s, new_d
def loss(self, s, t, r):
return - (r * np.log(0.9) * t / s + (1 - r) * torch.log(1 - torch.exp(np.log(0.9) * t / s)))
def mean_reversion(self, init, current):
return self.w[5] * init + (1-self.w[5]) * current
class WeightClipper(object):
def __init__(self, frequency=1):
self.frequency = frequency
def __call__(self, module):
if hasattr(module, 'w'):
w = module.w.data
w[0] = w[0].clamp(0.1, 10)
w[1] = w[1].clamp(0.1, 5)
w[2] = w[2].clamp(1, 10)
w[3] = w[3].clamp(-5, -0.1)
w[4] = w[4].clamp(-5, -0.1)
w[5] = w[5].clamp(0, 0.5)
w[6] = w[6].clamp(0, 2)
w[7] = w[7].clamp(-0.2, -0.01)
w[8] = w[8].clamp(0.01, 1.5)
w[9] = w[9].clamp(0.5, 5)
w[10] = w[10].clamp(-2, -0.01)
w[11] = w[11].clamp(0.01, 0.9)
w[12] = w[12].clamp(0.01, 2)
w[13] = w[13].clamp(1, 10)
w[14] = w[14].clamp(1, 10)
w[15] = w[15].clamp(1, 10)
w[16] = w[16].clamp(1, 10)
module.w.data = w
def lineToTensor(line):
ivl = line[0].split(',')
response = line[1].split(',')
tensor = torch.zeros(len(response), 2)
for li, response in enumerate(response):
tensor[li][0] = int(ivl[li])
tensor[li][1] = int(response)
return tensor
Here's the updated table:
Making 2 different optimizable grades for S and D improved performance when compared to the old algorithm, but not when compared to just having optimizable grades at all. The p-value for that is not shown in the table (in the table I only compare changes to the baseline), but it was much >0.01, suggesting that making different grades for S and D doesn't improve performance.
Here, I have tried to extract from the above comments what needs to be done now so that we don't miss anything.
EasyBonus
setting because the optimizer would automatically calculate how much the stability/difficulty/interval should be changed with the press of the Easy button.I have another idea, but I can't implement it. Basically, I want to add something like a self-correcting property to the algorithm so that it can correct its own bad results.
The idea is that if R_avg is close to R_requested, then the ratio of their logarithms will be close to 1, so it won't affect the calculation of S very much. In other words, a good algorithm doesn't need to correct itself. If R_avg is significantly higher than R_requested, then that means that the algorithm is overestimating S and S needs to be decreased (and the ratio of logs in this case is <1). If R_avg is significantly lower than R_requested, then the algorithm is underestimating S and it needs to be increased (and the ratio of logs in this case is >1). This is especially helpful for very easy and very hard cards, for which the algorithm works poorly. Also, R_avg can bounce up very quickly after a lapse (though it depends on w_13, of course), so user won't be stuck in Ease Hell. If this ends up performing poorly due to C taking very large or very small values, I also already have an idea how to deal with that. And in case someone is wondering why logarithms, that's because ln(R_1)/ln(R_2)=S_2/S_1, where R_1=0.9^t/S_1 and R_2=0.9^t/S_2.
On a somewhat unrelated note, @L-M-Sherlock, is it possible to have FSRS enabled for every single repetition? No built-in Anki learning steps or re-learning steps, those will become optimizable parameters (or functions) as well.
@Expertium, I like the idea. But, can you explain how did you obtain the following equation?
R_avg_new = (1 - w_13) R_avg_previous + w_13 response
Edit: I got it.
The above equation can be re-written as
R_avg_new = R_avg_previous + w_13 * (response - R_avg_previous)
So, average new retention is calculated by adding a certain fraction (optimized) of the difference between the response and the previous average retention to the previous average retention.
Although w_14 may end up being unnecessary - we can just use requested retention instead.
Comparing the equation for the first review to that for any subsequent review, the w_14 is equivalent to the average retention before the first review. I don't think that it can be taken as the requested retention.
Edit 2: Doesn't this idea (the addition of the C factor) seem to be a heuristic approach? I mean that the original FSRS algorithm is based on a lot of research conducted by @L-M-Sherlock, while this approach seems like we are just trying to modulate the equations and trying to get the desired results by trial and error.
Edit 3: It dawned on me that this is basically what research is. You first make an hypothesis based on certain considerations. You then test it through various means, and then modify the hypothesis based on the outcomes of the experiments.
But, for this approach to give good results, we shall need to test it out with data from large number of users. However, initial testing with some users may suffice to eliminate most of the bad ideas.
But, can you explain how did you obtain the following equation?
It's just exponential smoothing. You can read more here: https://en.wikipedia.org/wiki/Exponential_smoothing
The above equation can be re-written as R_avg_new = R_avg_previous + w_13 * (response - R_avg_previous)
I think R_avg_new = (1 - w_13) R_avg_previous + w_13 response is more clear, but whatever.
Edit 2: Doesn't this idea (the addition of the C factor) seem to be a heuristic approach? I mean that the original FSRS algorithm is based on a lot of research conducted by @L-M-Sherlock, while this approach seems like we are just trying to modulate the equations and trying to get the desired results by trial and error.
Yes, but I'm not as smart as Woz, and I'm also not going to spend 35 years collecting data and improving algorithms just to produce a brilliant idea.
Edit 3: It dawned on me that this is basically what research is. You first make an hypothesis based on certain considerations. You then test it through various means, and then modify the hypothesis based on the outcomes of the experiments.
Yeah, although you can do it in different ways. As I mentioned in another issue, we can adopt either the "Anything that decreases the loss is good, no matter how ad hoc" philosophy, or the "The algorithm must have a solid theoretical foundation and all values must be easily interpretable, no 'black boxes'" philosophy. If you take the first philosophy to its logical extreme, you will end up with a neural network - those things work wonders, but trying to interpet how exactly they work is almost futile. If you take the second philosophy to its logical extreme, you will end up with SM-18...after 35 years of collecting data and doing big brain thinking.
Based on recent discussion, I implement these variants for FSRS optimizer and record their performance in my collection:
Variant | Colab Link | Log Loss | RMSE | R-squared | Number of extra parameters |
---|---|---|---|---|---|
Original Optimizer | v3.17.1 | 0.3163 | 0.0209 | 0.8916 | 0 |
Power forgetting curve (fixed power) | Expt/power-forgetting-curve | 0.3147 | 0.0200 | 0.9057 | 0 |
Power difficulty (adaptive power) | Expt/power-function-for-difficulty | 0.3151 | 0.0210 | 0.8932 | 1 |
Power stabilization curve | Expt/power-stabilization-curve | 0.3160 | 0.0216 | 0.8862 | 0 |
Trainable ratings (only for difficulty) | Expt/trainable-ratings | 0.3159 | 0.0209 | 0.8917 | 4 |
When we discuss the further improvement of FSRS, we shouldn't only consider the loss. The number of extra parameters is also important. As the number of parameters increases, the risk of overfitting becomes larger.
Hmm, that is strange - your loss and RMSE barely changed when you implemented trainable grades.
Also, shouldn't this formula be different? new_s = self.w[0] + self.w[1] * (x[1] - 1) x[1]-1 should be x[1]-self.w[13]
new_s = self.w[0] + self.w[1] * (x[1] - 1) x[1]-1 should be x[1]-self.w[13]
In pre-train stage, the trainable parameters should only be initial stability. Introducing trainable ratings here would result in inaccurate initial stability.
Ok, I'll test this later today and compare the results to what I had with my own implementation of trainable grades.
@L-M-Sherlock Regarding overfitting - why not split data into 2 datasets, "training" and "test" set? Optimize parameters on training set, then see how big the loss is on the test set. If test set loss is much higher than training set loss, that means the model is overfitted. That's a standard practice in ML, in fact, I'm surprised you haven't done it yet.
On an unrelated note, I don't see clamping for new parameters (for grades) in the code.
@L-M-Sherlock Regarding overfitting - why not split data into 2 datasets, "training" and "test" set? Optimize parameters on training set, then see how big the loss is on the test set. If test set loss is much higher than training set loss, that means the model is overfitted. That's a standard practice in ML, in fact, I'm surprised you haven't done it yet.
Because it will cost more time of optimization. In the start of this project, I fixed the structure of model, so it didn't need to test.
On an unrelated note, I don't see clamping for new parameters (for grades) in the code.
Because it is experimental and today is workday. I don't have too much time here.
I tried all these versions of the optimizer. My observations:
Here is a table comparing the results: | Log Loss | RMSE | R-squared |
---|---|---|---|
Original Optimizer | 0.2241 | 0.0191 | 0.8728 |
Power forgetting curve (fixed power) | 0.2240 | 0.0173 | 0.8949 |
Power forgetting curve (adaptive power) | 0.2231 | 0.0147 | 0.9234 |
Power difficulty (adaptive power) | 0.2234 | 0.0170 | 0.9026 |
Adaptive grades (by @Expertium) | 0.2220 | 0.0105 | 0.9632 |
sm-stabilization-curve (found this as branch on GitHub) | 0.2228 | 0.0153 | 0.9179 |
Power stabilization curve | 0.2222 | 0.0177 | 0.8946 |
Trainable ratings (only for difficulty) | 0.2233 | 0.0184 | 0.8842 |
I would appreciate it if you tried Sherlock's version of adaptive grades to see if it produces the same or different results compared to mine
Also, what is sm stabilization curve?
I would appreciate it if you tried Sherlock's version of adaptive grades to see if it produces the same or different results compared to mine
I have already tried. See the last row in the table. The results with Sherlock's version are worse than with your version.
Oh, my bad, I see now.
Also, what is sm stabilization curve?
I found it as a branch on this GitHub repository. It is probably an incomplete implementation of something @L-M-Sherlock is trying to implement. While I was testing the other versions, I thought why not test this one too even if it is incomplete. (All tests run in parallel in separate tabs.)
Link: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/sm-stabilization-curve
I haven't fully tested Sherlock's version yet, but I think I have identified the problem. I have replaced this line:
new_s = self.w[0] + self.w[1] * (x_new[1] - self.w[13])
in my own version with this line:
new_s = self.w[0] + self.w[1] * (x[1] - 1)
and after that I got almost the same results as with the original algorithm. I changed it back and loss and RMSE went down again.
In other words, for some strange reason not using trainable grades in this formula completely negates any benefit that could be gained otherwise.
Also, what is sm stabilization curve?
For some strange reason not using trainable grades in this formula completely negates any benefit that could be gained otherwise.
This probably means that trainable grades are more important for the calculation of stability than for the calculation of difficulty.
This is also probably the reason I didn't get good results when I tried to train the grades for the stability and the difficulty separately.
I'm going to try it with Sherlock's version as well, to see if it makes a difference.
EDIT: yes, Sherlock's version has the same problem. new_s = self.w[0] + self.w[1] * (x[1] - 1) doesn't work
as well as new_s = self.w[0] + self.w[1] * (self.w[12+x[1].long()] - self.w[13])
I'm going to try something interesting - I will replace difficulty with a constant. If that doesn't perform worse than baseline, then it means that the current definition of difficulty is useless.
This probably means that trainable grades are more important for the calculation of stability than for the calculation of difficulty.
However, except for the initial stability, the ratings don't affect the stability directly.
This probably means that trainable grades are more important for the calculation of stability than for the calculation of difficulty.
However, except for the initial stability, the ratings don't affect the stability directly.
Then, it probably means that all (or most) of the benefit that could be gained by using adaptive grades is because of how they affect the initial stability.
In other words, it probably means that the difference between the initial stabilities for cards marked again
, hard
, good
and easy
is not equal.
For example, if the difference between the initial stabilities of cards marked again
and hard
is 1, then the difference between the initial stabilities of cards marked hard
and good
is not necessarily 1 but can also be 1.5 or 2.
(Currently, FSRS considers the difference between the initial stabilities to be same, which is w[1].)
Which module is related to your feature request? Scheduler/Optimizer
Is your feature request related to a problem? Please describe. Here I will post all my ideas (new formulas) to improve the algorithm - decrease log-loss and RMSE.
Describe the solution you'd like
Change the way S is calculated after a lapse ("Again") to include the number of lapses. So from this:To this:Where L - number of lapses. +1 is needed to avoid multiplication by 0, which would result in S=0. w_13 should be negative, otherwise more lapses will lead to higher S, which is nonsence. This introduces 1 new parameter.https://github.com/open-spaced-repetition/fsrs4anki/issues/239#issuecomment-1532173351Forget about it, anything that requires R_requested won't work. R_requested can be different if the user decided to change it at some point, and more importantly, it's just not a thing in native Anki, there is no such thing as setting a desired probability of recall.Also, I suggest replacing w_9 with e^w_9. It probably won't matter, but it's just more consistent since in the other formula you use e^w_6 rather than just w_6.
None of these changes affect the forgetting curve (you can use either 0.9^t/S or (1+t/9s)^-1), so the meaning of S - number days it takes for user's retention to drop from 100% to 90% - will remain unchanged.
EDIT: purely for the sake of clarity and keeping things easy to understand, I think all formulas should be rewritten in such a way that parameters are always positive. For exampple, if you have
D^w
(w
is a parameter) andw
is is always positive - good, don't change anything. Ifw
is always negative - rewrite the formula asD^-w
and keepw
positive. This should make it easier to read the wiki.