Closed Expertium closed 11 months ago
Also, regarding S and siblings.
It's quite easy to add a new parameter, such that whenever a card is reviewed, all siblings get a bonus to their S, S_new = w*S
. The issue is that it would require re-writing the entire optimizer code since, after that change, in order to determine the S of a card, just the card's history won't be enough, and now the optimizer needs to go through the history of all siblings of that card as well.
Idea 2: A new power function R=f(S, t).
def power_curve(delta_t, S): return (1 + delta_t / (4.26316 * S)) ** -0.5 next_t = max(1, round(float(4.26316 * (states[0] - (requestRetention ** 2) * states[0]) / (requestRetention ** 2))))
Here 4.26316 is just a constant that ensures that R=90% when S=t. I have already tested it in v4.0.0 Beta, it decreases RMSE by roughly 22% for my decks.
Pros: very easy to implement and benchmark. Cons: this deviates from theory even further, as in theory forgetting should be exponential.
@L-M-Sherlock @user1823 thoughts? I would like to see Sherlock benchmark this.
Currently, I have just one suggestion: Use the accurate value of the constant.
def power_curve(delta_t, S):
return (1 + 19 * delta_t / (81 * S)) ** -0.5
If you are wondering how to calculate this, refer to the formula I shared here: https://github.com/open-spaced-repetition/fsrs4anki/issues/215#issuecomment-1527456460 (obtained using Wolfram Alpha)
Thank you, I was wondering how to derive the value of that constant. Also, this comment of mine was wrong, as the shape does change. Though we still need to figure out the inverse function (that gives t when R and S are known) in its general form. If you can do that, that would be great!
Though we still need to figure out the inverse function (that gives t when R and S are known) in its general form. If you can do that, that would be great!
Well, this can be obtained by using basic algebra on the formula I shared previously.
But, I still calculated it for you: $$t = S \times \frac{R^{1/f} - 1}{0.9^{1/f} - 1}$$
Thank you. @L-M-Sherlock what's your opinion on the following questions (user1823, you too)
1) Should this be implemented at all?
2) If yes, then should this be implemented now or in some future version, months from now?
3) Should we make it a new optimizable parameter that is different for different users, or use an average value of f
, the same value for everyone?
EDIT: it has to be fixed, otherwise it's impossible to accurately estimate initial S before the optimization starts.
In my opinion, the answers would depend upon the results of benchmark testing.
Make f
an optimizable parameter. Then, look for answers to:
f
for different collections? Do they have similar values of f
or does it differ significantly among different collections?If FSRS wasn't being integrated into Anki, I would suggest implementing this later, because this would introduce more issue with compatibility between different versions and all of that. But since FSRS is being integrated into Anki, we might as well do some changes. It's kinda like using a new building material - you're not going to demolish an entire building just to use a new type of concrete to rebuild it, but if you have to build a new building anyway, you might as well use the new type of concrete.
How much did the RMSE decrease?
About 20-25% for my decks (fixed f
), but of course we need a proper benchmark.
What is the optimized value of f for different collections? Do they have similar values of f or does it differ significantly among different collections?
Good point.
Actually, I just realized that making f
optimizable rather than fixed creates a major problem - initial S is estimated before the optimization starts, so it's impossible to estimate initial accurately S if f
will be changed later.
While this rules out making f
an optimizable parameter, it doesn't rule out using some fixed value of f
. It would take a lot of time, but Sherlock could run the benchmark with several different values of f
to find which one decreases RMSE the most.
def power_curve(delta_t, S): return (1 + 19 * delta_t / (81 * S)) ** -0.5
I will test it in the benchmark soon.
Idea 3: don't remove same-day reviews. Don't change S, but only change D. Currently, the optimizer doesn't take same-day reviews into account because it's unclear how they would affect S, we don't have a formula for short-term memory. However, what if they only affect D? This could help with differentiating cards based on difficulty. In other words, when a same-day review occurs, only the formula for D should be used to modify the value of D. The value of S should remain unchanged. This would also mean adding a few more parameters, since the optimal parameters for D when it's a same-day review could be different from the parameters for when it's a review after >1d. Pros: could improve estimation of difficulty. Cons: would require major changes to all modules (optimizer, scheduler, helper).
I will test it in the benchmark soon.
Algorithm | Log Loss | RMSE | RMSE(bins) |
---|---|---|---|
FSRS v4 (old) | 0.3830 | 0.3311 | 0.0543 |
FSRS v4 (new) | 0.3788 | 0.3298 | 0.0476 |
p = 0.1 (No significant difference)
That's very strange, are you sure you changed everything correctly?
1) The power function in the section where initial S is estimated
2) The power function in the main optimization section
3) The inverse function (next_t)
There is only one function for calculating the retention in fsrs-optimizer:
def power_forgetting_curve(t, s):
return (1 + t / (9 * s)) ** -1
So I only modified that function.
Ah, ok. When I was testing it in an older version, I had to modify it in two places. Assuming you didn't forget to modify the inverse function too, I'm surprised that the difference wasn't statistically significant. I would recommend running the benchmark with a few more values of the power, see user1823's formulas. I recommend trying 0.4 and 0.3.
I think the effect of idea 2 depends on the degree of heterogeneity of user's data. It really improved the performance in some users' collections, but it also made some users' performance worse. Besides, if you are learning language, the immersion study beyond Anki would also flatten the forgetting curve.
If the power could become an optimizable parameter, it would be much better and would adapt to each user. The issue is that I don't know how to combine that with our method of estimating initial S.
Actually, how about this: after the 4 initial values of S are estimated, we allow the optimizer to change them? Though choosing clampings is an issue.
@L-M-Sherlock how about this: the initial S is estimated using f=0.5 (or 1), then 4 values are passed to the optimizer as parameters, with clampings (0.1, 365), and then the optimizer can change them, as well as f? This way the optimizer can use those estimates as initial guesses and update them as f changes.
Related to dispersing siblings: https://github.com/thiswillbeyourgithub/AnnA_Anki_neuronal_Appendix I don't know how computationally expensive this is aka how much it will slow Anki down and introduce lag, and it also doesn't seem like it wil be easy to integrate it into FSRS. Still, worth trying in the future. For now I still would like Sherlock to try what I suggested in the comment above, regarding the power function.
@L-M-Sherlock, I have a question. I have 2 ideas with a lot of potential. 1) https://github.com/open-spaced-repetition/fsrs4anki/issues/461#issuecomment-1723526030 2) https://github.com/open-spaced-repetition/fsrs4anki/issues/461#issuecomment-1722475003 The first one is easier to implement, and also more likely to bear fruit. You could start working on them now and as a result make FSRS v5 and integrate it into Anki, so that when a stable verison of Anki is released, it will have FSRS v5; or you could wait and work on these ideas later, release FSRS v5 later, and when the new stable Anki version is released, it will have FSRS v4, for some time. Do you want to start working on this sooner or later?
I plan to add fsrs-rs into fsrs-benchmark. It is in high priority, because I want to know the difference between these two implementation.
Then I have another research work to develop, where I need to collaborate with a PhD from the University of Minnesota. It would also take me a lot of time. So FSRS v5 wouldn't start in weeks.
@L-M-Sherlock do you think there is a need for V5. V4 seems to be more than good enough.
I'm not Sherlock, obviously, but I believe that if there is a way to make the algorithm more accurate, then why not?
he seems to be working really hard :(
Well, it's not the end of the world if he starts working on what I suggested above a few weeks or a few months later, just as long as he doesn't completely abandon improving accuracy. Ideally, I would love it if FSRS got so accurate that RMSE would be ~1% for all users (or at least the vast majority of users). If most users had RMSE around 1%, I would say that at that point, the pursuit of accuracy is no longer important.
Also, I've mentioned this before, but the correlation coefficient between log-loss and RMSE(bins) is only around 0.4-0.5 (according to my testing), and it's not uncommon to see a huge improvement in RMSE while log-loss barely changes. So ideally, we should find a way to optimize directly for RMSE rather than for log-loss. But I have no idea how to do that while still using the mini-batch approach.
@L-M-Sherlock I apologize for the inconvenience, but can you help me? I'm trying to make a custom loss function that mimics RMSE, but for some reason both train and test losses don't change during optimization.
class CustomLoss(nn.Module):
def __init__(self):
super(CustomLoss, self).__init__()
def forward(self, predictions, labels):
bins = 5
zip_ab = torch.stack((predictions, labels), dim=1)
counts = torch.zeros(bins)
correct = torch.zeros(bins)
prediction = torch.zeros(bins)
for p, r in zip_ab:
bin = min(int(p * bins), bins - 1)
counts[bin] += 1
correct[bin] += r
prediction[bin] += p
prediction_means = torch.tensor(prediction) / torch.tensor(counts)
prediction_means = torch.nan_to_num(prediction_means, 0)
correct_means = correct / counts
correct_means = torch.nan_to_num(correct_means, 0)
numerator = torch.sum(counts * (correct_means - prediction_means) ** 2)
denominator = torch.sum(counts)
RMSE = torch.sqrt(numerator/denominator)
return RMSE.clamp(0.00001, 100)
And then replace self.loss_fn = nn.BCELoss(reduction='none')
with self.loss_fn = CustomLoss()
. I'm doing this in an older version, 4.0.0 Beta.
This is what happens:
EDIT: maybe it has something to do with the fact that this loss function is non-differentiable, but if that's the case, I would expect pytorch to raise an error and stop optimization. EDIT 2: I think that the for loop is the problem, so now I'm trying to figure out how to get rid of it and use something else instead. It seems that due to problems with differentiation, it's impossible to use RMSE itself, but it may be possible to use something similar. EDIT 3: I could use a different approach, the one Woz used, where instead of bins he just takes raw 1's and 0's and smoothes them using a moving average. But for that to work, values in the tensor must be sorted by predicted R, kinda like this:
([[0.0718, 0.0000],
[0.0719, 1.0000],
[0.1261, 0.0000],
[0.6392, 1.0000],
[0.6617, 1.0000],
[0.8616, 1.0000]])
And the sort() function is obviously not differentiable. So this doesn't work either.
I'm trying to make a custom loss function that mimics RMSE, but for some reason both train and test losses don't change during optimization.
Because your loss function is not differentiable. The error backward propagation is broken in this loss function, so the weights would never update.
Yeah, I already figured it out. I also tried a different approach, with smoothing labels, but that requires sorting, which would also make the loss function non-differentiable.
There exists this: https://github.com/teddykoker/torchsort
No idea how they made sorting differentiable, but this could work. I'll try to use it for the loss function later.
Btw, are you working on converting the benchmark into Rust?
Btw, are you working on converting the benchmark into Rust?
It has been completed.
There exists this: https://github.com/teddykoker/torchsort
For some reason I can't install this on my PC, so I can't play around with it before trying to integrate it into a loss function in a google colab version of FSRS. Maybe in the future, once you're ready to work on FSRS v5, I could describe my idea to you and maybe you could implement it using this differentiable sort.
@L-M-Sherlock I have developed a smooth version of binned RMSE, but it has some other problem. Here's the code, if you won't understand what exactly is going on there, I can explain it.
class CustomLoss(nn.Module):
def __init__(self):
super(CustomLoss, self).__init__()
def forward(self, predictions, labels):
# everything below is hardcoded to work with 5 bins
a = 200
# a new alternative to discrete bins
smooth_bins = 1/(1+torch.exp(-a*(predictions-(1/5)))) + 1/(1+torch.exp(-a*(predictions-(2/5)))) + 1/(1+torch.exp(-a*(predictions-(3/5)))) + 1/(1+torch.exp(-a*(predictions-(4/5))))
b = 4
# the values in bin0-bin4 can be interpeted as "how much does the respective value of predicted R belong to this bin?", it's a smooth alternative to counts
# close to 0 = predicted R doesn't belong to this bin
# close to 1 = predicted R belongs to this bin
# in the discreet version, the values can only be 0 or 1, but in the smooth version, they can be any real number in [0, 1]
bin0 = torch.exp(-b*((0-smooth_bins) ** 2))
bin1 = torch.exp(-b*((1-smooth_bins) ** 2))
bin2 = torch.exp(-b*((2-smooth_bins) ** 2))
bin3 = torch.exp(-b*((3-smooth_bins) ** 2))
bin4 = torch.exp(-b*((4-smooth_bins) ** 2))
# weighted averages of predicted R in each bin
bin0_wmean_p = torch.sum(bin0 * predictions)/torch.sum(bin0)
bin1_wmean_p = torch.sum(bin1 * predictions)/torch.sum(bin1)
bin2_wmean_p = torch.sum(bin2 * predictions)/torch.sum(bin2)
bin3_wmean_p = torch.sum(bin3 * predictions)/torch.sum(bin3)
bin4_wmean_p = torch.sum(bin4 * predictions)/torch.sum(bin4)
# weighted averages of measured recall
bin0_wmean_l = torch.sum(bin0 * labels)/torch.sum(bin0)
bin1_wmean_l = torch.sum(bin1 * labels)/torch.sum(bin1)
bin2_wmean_l = torch.sum(bin2 * labels)/torch.sum(bin2)
bin3_wmean_l = torch.sum(bin3 * labels)/torch.sum(bin3)
bin4_wmean_l = torch.sum(bin4 * labels)/torch.sum(bin4)
# sums of "smooth counts" in each bin
# they are pretty close to sums of real counts in discreet bins
bin0_sum = bin0.sum()
bin1_sum = bin1.sum()
bin2_sum = bin2.sum()
bin3_sum = bin3.sum()
bin4_sum = bin4.sum()
# weighted differences for each bin
w_diff0 = bin0_sum * (bin0_wmean_p - bin0_wmean_l) ** 2
w_diff1 = bin1_sum * (bin1_wmean_p - bin1_wmean_l) ** 2
w_diff2 = bin2_sum * (bin2_wmean_p - bin2_wmean_l) ** 2
w_diff3 = bin3_sum * (bin3_wmean_p - bin3_wmean_l) ** 2
w_diff4 = bin4_sum * (bin4_wmean_p - bin4_wmean_l) ** 2
# final result
smooth_RMSE = torch.sqrt((w_diff0 + w_diff1 + w_diff2 + w_diff3 + w_diff4) / (bin0_sum + bin1_sum + bin2_sum + bin3_sum + bin4_sum))
return smooth_RMSE.clamp(0.00001, 100)
Now here's the weird part. Initially it works: all values look good, loss is decreasing, it seems that backpropagation is working fine.
But then all values of predicted R turn into NaN's.
So my guess is that either the gradient explodes somewhere and causes an overflow, or the gradient vanishes to zero. Or something like that, something related to numerical precision. I really hope that you can help me, because this idea has lots of potential. I've already mentioned it multiple times, but log-loss isn't perfectly correlated with binned RMSE, so we could likely improve RMSE a lot by using a different loss. I believe that my implementation is very close to working properly; I just need help with this final problem. In theory, there is nothing non-differentiable here, so if something goes wrong, it likely has something to do with numerical precision and floating point weirdness.
Yep, it was some kind of floating point weirdness. Changing a
from 200 to 100 made it work.
I've tested it in 4.0.0 Beta, and it decreased the final binned RMSE by 5% on average. Well, this is underwhelming. It seems that for users who already have fairly low RMSE, it makes it worse. This is surprising, I had very high hopes for this loss function. Though we can further fine-tune a
, b
, and the number of bins, but that would be practically impossible without some sort of automated grid search.
As I mentioned in https://github.com/open-spaced-repetition/spaced-repetition-algorithm-metric/issues/1, it's possible to cheat binned RMSE. Log Loss is still the preferred metric.
Idea 3: don't remove same-day reviews. Don't change S, but only change D.
Work in Progress: https://github.com/open-spaced-repetition/fsrs-optimizer/tree/Feat/short-term-schedule
I don't see the changes
It doesn't need to modify the model. I just allow the model to accept interday review.
Then how do you make it so that S doesn't change? Also, I think that same-day reviews and long-term reviews are sufficiently different to warrant using different parameters for D. In other words, we would need 4 more parameters.
Also, after the idea with D I would like you to try this: https://github.com/open-spaced-repetition/fsrs4anki/issues/461#issuecomment-1723526030
When delta_t=0
, the r is one, so the SInc is one, too. The stability will not increase.
When
delta_t=0
, the r is one, so the SInc is one, too.
I see. As for parameters, you could try using the same parameters and different parameters for same-day and long-term reviews, and see if it makes a difference.
Here is another loss function, not based on RMSE this time:
class CustomLoss(nn.Module):
def __init__(self):
super(CustomLoss, self).__init__()
def forward(self, predictions, labels):
x = torch.abs(predictions-labels)
alpha = 1
if alpha == 1:
loss = -torch.log(1-x)
else:
loss = (alpha/(alpha-1)) * (1-torch.pow(1-x, 1-(1/alpha)))
return loss.clamp(0.00001, 100)
Relevant paper: https://arxiv.org/abs/1902.04639. Basically, log-loss can be seen as a special case of this function. Alpha can be fine-tuned on large bodies of data from many users. I would suggest going in increments of 0.05, starting from alpha=0.75 and to alpha=1.50. However, I tested it a little bit, and I'm not sure if there exists a value of alpha such that it makes the final RMSE lower for all users (or at least the vast majority), it seems that different values are better for different users. This, of course, requires testing on many collections, much more than I have, but if it's true that there is no value that is universally better for everyone, it may still be useful: run the optimizer multiple times with different values of alpha, and then select the final optimal parameters that correspond to the lowest RMSE. Though that would make optimization much slower.
Algorithm | Log Loss | RMSE | RMSE(bins) |
---|---|---|---|
FSRS v4 (short-term) | 0.3763 | 0.3284 | 0.0527 |
FSRS v4 | 0.3820 | 0.3311 | 0.0547 |
Reduce ~4% RMSE(bins).
Please also test it with 4 new parameters for D.
I need to deal with the space limit of GitHub before test that. GitHub only has 1 GB large file space for my organization and I have run out of it. I plan to migrate the dataset to hugging-face.
Algorithm | Log Loss | RMSE | RMSE(bins) |
---|---|---|---|
FSRS v4 (short-term + an extra param for D) | 0.3762 | 0.3283 | 0.0530 |
FSRS v4 (short-term) | 0.3763 | 0.3284 | 0.0527 |
FSRS v4 | 0.3820 | 0.3311 | 0.0547 |
No significant improvement.
Ok. Btw, you should report p-values as well. Even if a change is small, it might be statistically significant. Right now, I can't tell whether any of those 2 modifications are significantly better than the baseline.
How did you calculate the p-value in your above comparison https://github.com/open-spaced-repetition/fsrs4anki/issues/461#issuecomment-1749699706?
wilcoxon = scipy.stats.wilcoxon(list1, list2).pvalue
I've already shared this code before, so I thought you have it. The requirements are:
1) list1
and list2
must have the same length
2) The values (RMSE) in them must be in the same order, in the sense that if the first value in list1
corresponds to collection number 1, then the first value in list2
must also correspond to collection number 1, and so on.
This requires storing each individual value of RMSE for each collection, so if your tool doesn't do that, well, you'll have to modify it and run it again.
FSRS v4 (short-term + an extra param for D) metric: RMSE(bins)
[0.0436727 0.03891526 0.07077876 0.03023096 0.02303037 0.07943025
0.09796523 0.03881957 0.02618989 0.09467553 0.04605238 0.05018691
0.02107598 0.12429119 0.0308913 0.09286572 0.0154809 0.06795415
0.02577697 0.02818856 0.03159889 0.02604321 0.04552785 0.07640666
0.20995825 0.04814789 0.068789 0.02821913 0.04983103 0.02761347
0.0180587 0.06334363 0.08405417 0.1074614 0.06001543 0.06521105
0.06416036 0.12901405 0.02841174 0.03227344 0.03627659 0.08149775
0.03504154 0.04307414 0.08599463 0.02454969 0.11007824 0.07768198
0.03376889 0.06603339 0.02351576 0.04367831 0.13177285 0.03325768
0.04630706 0.0212299 0.14071096 0.02075332 0.03158164 0.07796272
0.04789457 0.03470105 0.12901405 0.07169742 0.08426874 0.03366842
0.07680579 0.02431026 0.02556063 0.02089902 0.02007906]
FSRS v4 (short-term) metric: RMSE(bins)
[0.04241678 0.03867456 0.07157832 0.03027802 0.02339608 0.0780774
0.10159418 0.03689583 0.02712977 0.09436468 0.04646398 0.05040396
0.02089216 0.12462688 0.02986748 0.09177812 0.01631185 0.06779186
0.02520268 0.0281984 0.03170996 0.02657709 0.04676382 0.08053423
0.20991729 0.04848806 0.07195603 0.03070107 0.04937659 0.02887608
0.01703606 0.06388293 0.08496899 0.1114117 0.06068563 0.06510311
0.06434896 0.12919515 0.02859861 0.03256329 0.03609464 0.08242436
0.03507658 0.04314541 0.08506663 0.02500591 0.11015774 0.07800003
0.03201884 0.06572443 0.02357949 0.04464861 0.13089531 0.03218396
0.04554602 0.02308676 0.14072921 0.0193745 0.0313562 0.07719207
0.04770614 0.03493355 0.12919515 0.07190192 0.08407294 0.0338681
0.07656596 0.02513343 0.02603192 0.02089604 0.02020214]
FSRS v4 metric: RMSE(bins)
[0.04075794 0.0268294 0.08121647 0.03093209 0.02593867 0.09839182
0.10583658 0.03891485 0.04406325 0.07954123 0.04679042 0.04804213
0.01725823 0.12281713 0.03161628 0.09923111 0.01536706 0.07257397
0.0261901 0.02569173 0.03196444 0.0261565 0.05705469 0.08994775
0.1575826 0.04797631 0.0785234 0.03610742 0.04819663 0.02947626
0.01658424 0.06502142 0.0858977 0.10590823 0.06729937 0.08071882
0.06540472 0.12620505 0.02711572 0.03664367 0.0365849 0.08201617
0.04044704 0.04468506 0.08523012 0.02453584 0.13007018 0.07525011
0.03627887 0.06687165 0.02289361 0.04512974 0.0952118 0.03163233
0.0387739 0.02291338 0.15349764 0.02354482 0.04563545 0.07794301
0.05052105 0.03235588 0.12620505 0.08722311 0.08706212 0.04426921
0.0847449 0.02430047 0.02147822 0.02105417 0.01377378]
I understand that FSRS v5 won't be released for a while, but I still decided to open this issue to collect good ideas. There is already an issue for D, so I won't be sharing ideas related to D, granted, I don't have any anyway. Idea 1: R-matrix. I have already talked about it here, but I want to talk more about the details. Last time Sherlock tried it, there was an important flaw with grouping, and unfortunately Sherlock still hasn't tried it with improved grouping, so here's how I think it should be done:
For
t
, a formula similar to this should be used:math.pow(1.4, math.floor(math.log(x, 1.4))), 2))
. This formula is used in the B-W Heatmap, and it should work well here too. Although, I suggest a slight modification:1.2*math.pow(1.4, math.floor(math.log(x, 1.4))), 2))
. This makes the rounded values closer to the original values on average. The problem with the first formula is that rounded values are always less than or equal to original values, so on average they are smaller than original values.For
n
, I came up with this:math.ceil(math.floor((x+0.75)**0.75)**(1/0.75))
. Again, I ensured that on average, rounded values are close to the original one, so there is no over- or underestimation.For
lapse
, just usemin(x, 8)
. This way all cards with >=8 lapses will be put into the same category, as they all are leeches.Pros: adding a self-correction mechanism to FSRS will allow it to correct its own bad predictions based on real data. Cons: hard to implement (especiall on mobile), possible non-monotonicity of the final estimate of R, which relies both on theoretical value and the R-matrix value.
Idea 2: A new power function R=f(S, t).
Here 4.26316 is just a constant that ensures that R=90% when S=t. I have already tested it in v4.0.0 Beta, it decreases RMSE by roughly 22% for my decks.
Pros: very easy to implement and benchmark. Cons: this deviates from theory even further, as in theory forgetting should be exponential.