Closed richard-lacasse closed 1 month ago
I'll benchmark it
This change will be useful for sorting.
However, I don't think that it would improve the algorithm because there won't be any significant change in the value of 11 - D
.
I will expect a greater improvement if we simultaneously update the formula that uses D for calculating S. (But, I don't know what changes should be made in that formula.)
I interpreted his suggestion as "Add this to the main formula". And surprisingly, it (kind of) worked!
I am shocked. I have tried so many changes to the D formula, so the fact that this helps even by 1% is surprising. @L-M-Sherlock we can improve RMSE by 1% (relatively) by doing this:
def next_d(self, state: Tensor, rating: Tensor) -> Tensor:
delta_d = - self.w[6] * (rating - 3)
new_d = state[:, 1] + delta_d * (1 - (state[:, 1] / 10)) # linear damping
new_d = self.mean_reversion(self.init_d(4), new_d)
return new_d
It's not much, but it's "free" - no new parameters are needed
OK. I will benchmark it in 20k collections tomorrow. If the result is stable, I will add it to FSRS-5.
One problem: when D = 10, 1 - D / 10 = 0, rating Easy
will not decrease D.
Dang. I'll modify it and benchmark again
def linear_damping_one_way(self, delta_d: Tensor, old_d: Tensor) -> Tensor:
return torch.where(delta_d <= 0, delta_d, delta_d * (10 - old_d) / 9)
def next_d(self, state: Tensor, rating: Tensor) -> Tensor:
delta_d = -self.w[6] * (rating - 3)
new_d = state[:, 1] + self.linear_damping_one_way(delta_d, state[:, 1])
new_d = self.mean_reversion(self.init_d(4), new_d)
return new_d
I will benchmark this one.
Ok, looks good. Why are you dividing by 9, though? Oh, ok, I get it. Because 10-D is never 10, it's at most 9.
Weird. The variant is worse than FSRS-5 in my initial test with top 352 collections:
Model: FSRS-5-dev
Total number of users: 352
Total number of reviews: 10744330
Weighted average by reviews:
FSRS-5-dev-1 LogLoss (mean±std): 0.2939±0.1591
FSRS-5-dev-1 RMSE(bins) (mean±std): 0.0495±0.0335
FSRS-5-dev-1 AUC (mean±std): 0.7001±0.0771
Weighted average by log(reviews):
FSRS-5-dev LogLoss (mean±std): 0.3348±0.1636
FSRS-5-dev RMSE(bins) (mean±std): 0.0681±0.0416
FSRS-5-dev AUC (mean±std): 0.6950±0.0892
Weighted average by users:
FSRS-5-dev LogLoss (mean±std): 0.3380±0.1646
FSRS-5-dev RMSE(bins) (mean±std): 0.0706±0.0426
FSRS-5-dev AUC (mean±std): 0.6942±0.0914
Model: FSRS-5
Total number of users: 352
Total number of reviews: 10744330
Weighted average by reviews:
FSRS-5 LogLoss (mean±std): 0.2938±0.1587
FSRS-5 RMSE(bins) (mean±std): 0.0493±0.0332
FSRS-5 AUC (mean±std): 0.6989±0.0813
Weighted average by log(reviews):
FSRS-5 LogLoss (mean±std): 0.3344±0.1629
FSRS-5 RMSE(bins) (mean±std): 0.0677±0.0411
FSRS-5 AUC (mean±std): 0.6949±0.0905
Weighted average by users:
FSRS-5 LogLoss (mean±std): 0.3375±0.1639
FSRS-5 RMSE(bins) (mean±std): 0.0701±0.0421
FSRS-5 AUC (mean±std): 0.6941±0.0927
I'm benchmarking the two-way method. And it performs better:
Model: FSRS-5-dev
Total number of users: 118
Total number of reviews: 3151255
Weighted average by reviews:
FSRS-5-dev LogLoss (mean±std): 0.3465±0.1428
FSRS-5-dev RMSE(bins) (mean±std): 0.0557±0.0322
FSRS-5-dev AUC (mean±std): 0.6978±0.0639
Weighted average by log(reviews):
FSRS-5-dev LogLoss (mean±std): 0.3624±0.1551
FSRS-5-dev RMSE(bins) (mean±std): 0.0726±0.0395
FSRS-5-dev AUC (mean±std): 0.6995±0.0924
Weighted average by users:
FSRS-5-dev LogLoss (mean±std): 0.3631±0.1570
FSRS-5-dev RMSE(bins) (mean±std): 0.0747±0.0398
FSRS-5-dev AUC (mean±std): 0.6987±0.0955
Model: FSRS-5
Total number of users: 118
Total number of reviews: 3151255
Weighted average by reviews:
FSRS-5 LogLoss (mean±std): 0.3474±0.1433
FSRS-5 RMSE(bins) (mean±std): 0.0563±0.0322
FSRS-5 AUC (mean±std): 0.6952±0.0648
Weighted average by log(reviews):
FSRS-5 LogLoss (mean±std): 0.3629±0.1555
FSRS-5 RMSE(bins) (mean±std): 0.0728±0.0394
FSRS-5 AUC (mean±std): 0.6977±0.0942
Weighted average by users:
FSRS-5 LogLoss (mean±std): 0.3637±0.1573
FSRS-5 RMSE(bins) (mean±std): 0.0749±0.0397
FSRS-5 AUC (mean±std): 0.6969±0.0975
So... should we implement it?
The two-way is the one where Easy (or any other grade) doesn't change difficulty if it's already at 10? No, let's not. Users will definitely complain.
Yeah. But it improves RMSE by 1% (relatively) among 8493 collections. It seems to say: ease hell is a myth, as similar to #695.
I still think it would increase the number of complaints from users. @user1823 thoughts?
Not decreasing the D when the user presses Easy seems to be wrong. But, the RMSE is decreasing. There can be two scenarios that explain this:
In either case, I think that implementing this would be worth it.
However, it will be great if we can think of some modification that prevents this problem while still producing a similar decrease in RMSE.
I'm afraid that "I pressed Easy and D is still at 100%" will outweigh any algorithmic benefits
If @richard-lacasse has an idea to solve this problem, then it would be great.
Otherwise, I think that we should implement this for the reason I mentioned here.
I'm afraid that "I pressed Easy and D is still at 100%" will outweigh any algorithmic benefits
To those users, say "just trust the algorithm".
Moreover, this issue is really not that important because with the proposed change, having a card with D = 10 is difficult.
Even currently, I have only 171 cards out of 15000 cards that have D exactly equal to 10, even though 30% of my collection has D between 9 and 10.
It's an asymptote, the Difficulty should never reach 10 exactly, that's the point. You'd have to hit Again so many times that it get smaller than machine epsilon, but not sure that's gonna be possible.
Still, suppose that D=9.9. In that case difficulty will barely change if the user presses Easy.
But it will change a lot relative to the other cards in that space. If you want the changes to be reflected so users can see it, make Difficulty displayed on a log scale or something. I already have 99% of my cards showing up in the 90%-100% range, so the graph is already pretty useless.
Moreover, this issue is really not that important because with the proposed change, having a card with D = 10 is difficult.
Even currently, I have only 171 cards out of 15000 cards that have D exactly equal to 10, even though 30% of my collection has D between 9 and 10.
Maybe I'm just weird and this isn't a problem for other people.
This is what my D distribution looks like. When I do a search, 2650/8888 cards fall into prop:d=1
, which is maxed out.
Still, suppose that D=9.9. In that case difficulty will barely change if the user presses Easy.
One issue I can see is that even though it will be fine for sorting purposes, D is still affecting S, so stability won't get as affected by the Easy button once it get's close to 10. But if it's performing better in the simulations anyway...
But, it will take much longer to get near 9.9 because of the asymptote. Every change that's closer to 10 is smaller, so there will have to be a lot of Again button presses to even get near that.
Alright, fine. @L-M-Sherlock it's up to you. Implement the two-way version if you want
Are people ever hitting Easy on high difficulty cards anyway? I know I never do. Those are the cards that have given me the most trouble.
But, it will take much longer to get near 9.9 because of the asymptote. Every change that's closer to 10 is smaller, so there will have to be a lot of Again button presses to even get near that.
I did an experiment to confirm that.
With my parameters and rating_sequence 3, 1, 1, 1, 1, 1, 1, 1
, I got the following successive values of D.
Current formula: 5.3624, 9.0620, 10, 10, 10, 10, 10, 10
Proposed formula: 5.3624, 7.0779, 8.1582, 8.8385, 9.2669, 9.5366, 9.7065, 9.8134
So, even though the current formula makes the D equal to 10 just after pressing Again twice, the proposed formula takes D to 9.8 even after 7 Again button presses.
Model | Parameters | LogLoss | RMSE (bins) | AUC |
---|---|---|---|---|
FSRS-5-Linear Dampening | 19 | 0.319±0.0053 | 0.049±0.0010 | 0.702±0.0028 |
FSRS-5 | 19 | 0.320±0.0052 | 0.050±0.0010 | 0.700±0.0028 |
The new median parameters: [0.40255, 1.18385, 3.173, 15.69105, 7.1949, 0.5345, 1.4604, 0.0046, 1.54575, 0.1192, 1.01925, 1.9395, 0.11, 0.29605, 2.2698, 0.2315, 2.9898, 0.51655, 0.6621]
The old median parameters: [0.4043, 1.18225, 3.13915, 15.6764, 7.2285, 0.4904, 1.0644, 0.0239, 1.6253, 0.1423, 1.0983, 1.95, 0.1014, 0.2998, 2.2834, 0.2407, 2.967, 0.5045, 0.66445]
Maybe I'm just weird and this isn't a problem for other people.
@richard-lacasse I have known other users with this so it's not a problem specific to you. I like your idea, and don't think practically anyone would reach a high D value, then press Easy and fret about how DSR value changed.
It is worth noting that w[7] is reduced from 0.0239 to 0.0046. And w[9] is reduced from 0.1423 to 0.1192.
Moreover, this issue is really not that important because with the proposed change, having a card with D = 10 is difficult. Even currently, I have only 171 cards out of 15000 cards that have D exactly equal to 10, even though 30% of my collection has D between 9 and 10.
Maybe I'm just weird and this isn't a problem for other people.
This is what my D distribution looks like. When I do a search, 2650/8888 cards fall into
prop:d=1
, which is maxed out.
@richard-lacasse Nope you are pretty normal. In the current update, my new decks clock around pretty ungodly Difficulty values.
the proposed formula takes D to 9.8 even after 7 Again button presses.
Can you run it to see how long it takes to reach 10 exactly? Mathematically that's impossible, but it will reach machine epsilon at some point and probably round up to 10.
Assuming that I wrote the code correctly, the D didn't reach 9.995 even after 2000 consecutive Again presses.
It is worth noting that w[7] is reduced from 0.0239 to 0.0046.
Will it cause more complaints?
It is worth noting that w[7] is reduced from 0.0239 to 0.0046.
Will it cause more complaints?
I don't think so. If D increases slowly on pressing Again, then it should also decrease slowly on pressing Good.
OK, I will implement it in FSRS-rs and benchmark it.
You need to implement it both in the Rust version and in the Python version, otherwise the comparison won't be fair.
You need to implement it both in the Rust version and in the Python version, otherwise the comparison won't be fair.
I just haven't pushed the updated python version to the repo. The benchmark result above is based on the Python version.
With the Linear Dampening, I have more cards whose difficulty is in 10th bin...
My previous parameters: [1.0962, 1.543, 7.8692, 12.0038, 8.1849, 0.5031, 0.6852, 0.001, 1.3281, 0.0796, 0.8666, 2.5417, 0.0128, 0.2952, 0.7547, 0.0001, 3.2912, 0.1097, 0.6747]
My current parameters: [1.114, 1.601, 8.0182, 12.2391, 8.0261, 0.5454, 3.0807, 0.0026, 1.255, 0.0208, 0.7996, 2.5398, 0.0122, 0.3219, 0.3941, 0.0001, 3.6492, 0.102, 0.6778]
The w[6] increases significantly from 0.68 to 3.08.
The difficulty increases quickly, too.
However, it improves the metric:
By the way, the model version will be updated to FSRS-5.5 if we introduce this change because I have released FSRS-5 in some packages.
Bad news: the change makes FSRS-rs worse. I don't know why. I need to check the consistency between the Python version and the Rust version again.
Update: OK, I find out a serious problem and fix it:
By the way, the model version will be updated to FSRS-5.5 if we introduce this change because I have released FSRS-5 in some packages.
Does this mean that the benchmark will have FSRS-5 and FSRS-5.5? Also, maybe call it FSRS-5.1 and not FSRS-5.5? It's a pretty small change, after all.
Also, maybe call it FSRS-5.1 and not FSRS-5.5? It's a pretty small change, after all.
It's not a small change because some parameters have been changed significantly. Actually, I guess we need to do something like:
some parameters have been changed significantly.
I agree. For example,
My parameters with current main: 1.0588, 4.7011, 31.5471, 79.2369, 7.561, 0.9454, 3.0717, 0.0, 1.8531, 0.3046, 1.3861, 1.6149, 0.0357, 0.4327, 1.8443, 0.0003, 6.0, 0.9915, 0.6734
My parameters with linear damping: 1.0652, 4.7012, 30.4195, 79.3164, 7.5425, 1.0393, 3.8364, 0.0, 1.8143, 0.292, 1.3124, 1.5111, 0.0276, 0.4369, 1.7987, 0.0003, 6.0, 1.013, 0.6778
So, there is a 25% increase in w6 and a 23% decrease in w12. The other parameters have changed slightly.
RMSE with old parameters evaluated using old optimizer: 0.0140 RMSE with new parameters evaluated using new optimizer: 0.0138 RMSE with old parameters evaluated using new optimizer: 0.0156
This means that if the user doesn't optimize their parameters after updating, their RMSE will increase. (This is true for almost any change to the algorithm, though.)
Actually, I guess we need to do something like:
Unfortunately, finding a mathematical relation between the old and new parameters will be very difficult in this case.
It's not a small change because some parameters have been changed significantly.
But the impact on metrics is small, around 1%. I think FSRS-5.1 makes more sense than FSRS-5.5 in this case
Another problem: what if the initial difficulty is 10? In this case, the difficulty is immutable (if w[7] is 0).
Fine. Now we have the reason to set a non-zero lower limit for w[7].
Just as discussed in the other topic about w7 https://github.com/open-spaced-repetition/fsrs4anki/issues/695#issuecomment-2426490983 this change has shifted most optimal value of w7 more into negative, so this means optimizer wants to shit all cards into D=10 and clipper being mode value means that for most of the people all cards will want to be at max D all of the time.
Then I guess we shouldn't implement damping after all.
The problem is not with the damping itself, but optimizer prefers D=10 with the current setup, so damping D will mean that optimizer accommodates by decreasing w7 even further till it clips to whatever value you set. The best solution would be to investigate why and change the formulas so that it doesn't go into that optimization path.
I've experimented with dozens of changes to the D formula and nothing worked. At this point I'm just waiting for someone smarter than me to completely re-define D.
A conservative converting is new w[6] = old w[6] + 0.5.
I've experimented with dozens of changes to the D formula and nothing worked. At this point I'm just waiting for someone smarter than me to completely re-define D.
@Expertium I know you've tried R as a parameter in the Difficulty equation, but did you also take R out of the parameter list in Stability? D is a parameter in Stability, so if R is a parameter in both, that might explain why adding R to Difficulty didn't help anything.
R affecting the increase in stability is one of the most important principles of the DSR model. Without it, we can't model the spacing effect since R depends on interval length, so removing it from the formula of S would mean that S doesn't depend on the interval length.
S' depends on S, which has the interval length information, and if R is in the formula for Difficulty, and D is in the formula for Stability, then R is indirectly affecting the increase in Stability still.
I'm spitballing here a little bit, but this is the first thing I would toy with.
I've noticed that with my cards, they very quickly reach the maximum Difficulty score of 10. This might be somewhat due to my specific parameters, but I suspect others might experience a similar pattern, perhaps to a lesser extent.
When I mark a card as Again or Hard, it increases the Difficulty score significantly. However, when I mark a card as Good, it barely changes. While this is likely intentional, after getting a card wrong about half a dozen times, its Difficulty score becomes stuck at 10. Even if I subsequently get the card right 100 times, getting it wrong just once pushes it back to the maximum Difficulty.
The issue with this is that sorting cards by Difficulty Ascending becomes less useful because many cards are tied at $D = 10$. As a result, I lose the granularity that this sorting method could provide. I think this could be addressed by using an asymptote instead of a hard cap at 10, while keeping the original equation for Difficulty ($D$) the same.
Currently, when $D$ exceeds 10, I believe the algorithm uses something like:
This means any information beyond 10 is lost each time it's capped. Instead, we could use an asymptote with a damping factor. For example:
1. Linear Dampening Factor
Let:
Then update $D\prime$ using:
In this equation, as $D$ approaches 10, the factor $1 - \frac{D}{10}$ decreases, reducing the increment $\Delta D$ and causing $D\prime$ to approach 10 asymptotically.
2. Exponential Damping Factor
Alternatively, you could parameterize the damping factor similar to how it's done in the Stability equation:
Here, $w$ would be a new weight and is a positive constant that controls the rate of damping. As $D$ approaches 10, the exponential term $e^{-w (10 - D)}$ decreases, causing $D\prime$ to approach 10 asymptotically.
Implementing an asymptote at 10 could preserve the usefulness of sorting by Difficulty and prevent the loss of information due to capping. It might be worth exploring this adjustment in the next version of FSRS to see if it improves the algorithm.