Closed Expertium closed 10 months ago
For grade-derived R, I have this code (it's mostly leftover from that one time we tried making a weighted average of predicted and grade-derived R):
from scipy.optimize import curve_fit
dataset = pd.read_csv("./revlog_history.tsv", sep='\t', index_col=None, dtype={'r_history': str ,'t_history': str} )
dataset = dataset[(dataset['i'] > 1) & (dataset['delta_t'] > 0) & (dataset['t_history'].str.count(',0') == 0)]
dataset['y'] = dataset['r'].map({1: 0, 2: 1, 3: 1, 4: 1})
dataset['lapse'] = dataset['r_history'].str.count('1')
dataset['delta_t'] = dataset['delta_t'].map(lambda x: round(1.2 * math.pow(1.4, math.floor(math.log(x, 1.4))), 2))
dataset['i'] = dataset['i'].map(lambda x: round(1.115 * math.pow(1.1, math.floor(math.log(x, 1.1))), 2))
dataset['lapse'] = dataset['lapse'].map(lambda x: min(x, 8))
grade_r_matrix = dataset.groupby(by=["delta_t", "i", "lapse"]).agg({'y': ['count', 'mean'], 'r': 'mean'}).reset_index()
grade_r_matrix.sort_values(by=[('r', 'mean')], inplace=True, ignore_index=True)
grade_means = grade_r_matrix['r']['mean']
r_means = (grade_r_matrix['y']['mean'] * grade_r_matrix['y']['count'] + 1 * grade_r_matrix['y']['mean'].mean()) / (grade_r_matrix['y']['count'] + 1)
count = grade_r_matrix['y']['count']
def func(x, a, b):
return 1-(np.power(1-np.power((x-1)/3, a), b))
params, covs = curve_fit(func, grade_means, r_means, bounds=((0.25, 0.25), (4, 4)), sigma=1/np.sqrt(count))
grade_derived_r_values = [func(x, *params) for x in [1, 2, 3, 4]]
grade_derived_r = dict(zip([1, 2, 3, 4], grade_derived_r_values))
The value for Again is always 0, the value for Easy is always 1. These 4 values will then be used in the hybrid loss function for optimization. Also, I tried using minimize
instead of curve_fit
, but for some reason it always sets parameters to their maximum or minimum values that are allowed by bounds.
As for weighing by recency, I suggest something like this:
alpha = 0.9
exp_weights = [alpha ** x for x in range(0, 10)]
Every value of the loss will be weighted based on how recent that review was. The most recent review will always have a weight of 1.0, more distant past reviews will have weight <1. This way D will depend more on recent reviews. Alpha will be an optimizable parameter.
The main problem of Best fit D
is it breaks the Markov property of FSRS. In FSRS, the current memory state only relies on the current input (delta_t, rating) and the last state (stability, difficulty). But Best fit D
requires the entire history.
Is it that important? I mean, from a practical point of view, not theoretical.
It's important. The current implementation of FSRS in Anki doesn't require the full review history when updating the memory state. If we breaks the Markov property, the time complexity of inference will jump from O(1) to O(n). And the time complexity of training will jump from O(n^2) to O(n^3).
Makes sense, though in practice it's possible that the amount of time required for inference and training will still be acceptable. But I don't think Dae will approve of this change, as it would require compeletely changing what data is available to the scheduler and how it's processed. Just to be perfectly clear, do you think that right now there is any reason to work on best fit D, or do you think that there is absolutely no reason whatsoever to work on it? EDIT: I would still like to see how well it performs in the benchmark, just to see the potential.
Wait, why is training O(n^2)? Shouldn't it be O(mn), where m is the number of epochs and n is the number of reviews?
Wait, why is training O(n^2)? Shouldn't it be O(mn), where m is the number of epochs and n is the number of reviews?
If a card has 9 review logs, it will generate 9 training samples. And a sample of length 9 requires the model to update the state 9 times. So it has 9 + 8 + 7 + … + 1 = 45 times of calculation.
If we introduce Best fit D
, a sample of length 9 requires the model update the state 9 times. And each update requires more calculation.
If you want to see the potential, we can use transformer. Its time complexity is O(n^2), too. ChatGPT is powered by transformer. But it is hard to train.
Yeah, going the neural network route kinda defeats the whole point of FSRS. It's not interpretable by humans and it has orders of magnitude more parameters.
I wonder how SuperMemo does it. My definition of best fit D is almost entirely based on the article about SM-17. So it works fine for SM-17.
Out of curiosity, do you know how long it takes to calculate DSR for one review, in Rust version, on your PC? Like, 1 millisecond? 100 microseconds?
I wonder how SuperMemo does it. My definition of best fit D is almost entirely based on the article about SM-17. So it works fine for SM-17.
But it is replaced in SM-18: https://supermemo.guru/wiki/Item_difficulty_in_Algorithm_SM-18
Out of curiosity, do you know how long it takes to calculate DSR for one review, in Rust version, on your PC? Like, 1 millisecond? 100 microseconds?
The benchmark of performance is implemented by @dae in https://github.com/open-spaced-repetition/fsrs-rs/pull/59.
open-spaced-repetition/fsrs-rs#59
I can't find any info about how long it takes to calculate DSR values
I don't know how to run the benchmark, so I recommend asking @dae.
It's just that I was thinking about the worst-case scenario. Suppose it takes 1 millisecond to calculate DSR values. Now suppose a card has been reviewed 101 times, so best fit D would have to go through 100 reviews (minus the first review). That's 100 milliseconds. And it likely won't converge in just one iteration, so let's say 5 iterations are required to guarantee convergence. That's 500 milliseconds between the user pressing "Show answer" and the answer buttons with their intervals being displayed. That's unacceptable. So unless DSR values take <100 microseconds to calculate, this probably won't be practical.
Currently study does not access the revlogs, so this would be adding a bunch of extra I/O and logic as well. The gains from it would have to be pretty big to justify the change.
The main problem of Best fit D is it breaks the Markov property of FSRS.
I thought that the plan was to use the best fit D approach to generate a set of optimal D values, try to build a function that approximates these values well and then use that function in the release version (not the optimization within optimization approach).
Here is a quote from @Expertium:
what if we did that in the optimizer, exported the resulting data (repetition history with best fit D for each review), and tried to come up with a function that approximates it really well? That's my idea. Compute best fit D, no matter how long it takes, and then try to find an approximation for it.
[...]
After this monstrosity is optimized, a file with the entire repetition history (delta_t, ratings, all that stuff) and best fit D should be available for download. Then you and I will download it, look at the values, plot them, analyze them however we want, and try to come up with a function that doesn't require this whole "optimization within optimization" crap, yet behaves similarly.
Originally posted by @Expertium in https://github.com/open-spaced-repetition/fsrs4anki/issues/282#issuecomment-1627781901
I think that it won't work. Approximating best fit D will likely result in something far less accurate, negating the benefit
If a card has 9 review logs, it will generate 9 training samples. And a sample of length 9 requires the model to update the state 9 times. So it has 9 + 8 + 7 + … + 1 = 45 times of calculation.
Maybe I'm misunderstanding something, but doesn't it mean that the first few reviews will affect the weights much more than the following reviews?
If the loss is calculated 9 times for the first review, it effectively means that the first review has a weight of 9, and if the loss is calculated only once for the last review, it effectively has a weight of 1.
So according to this, it takes about 9 microseconds per review. That's actually great! @L-M-Sherlock so going back to my worst case scenario above, if a card has been reviewed 101 times, then best fit D will have to go through 100 reviews, which will be 9•100=900 microseconds, or 0.9 milliseconds. Let's assume that 5 iterations are necessary for convergence, so that's 9•100•5=4500 microseconds, or 4.5 milliseconds. This means that this could be done without making Anki laggy when the user is pressing "Show answer". Though it would make optimization much longer. I would be glad it you started working on best fit D. I want to see a "proof of concept", in other words, I want to see if this idea has any merit if we ignore time constraints. If it improves RMSE greatly, then we can start thinking how to optimize it. If it barely improves RMSE, we can abandon this idea.
OK. I will try to implement the Best fit D
in the Python Optimizer for testing its potential.
@L-M-Sherlock so going back to my worst case scenario above, if a card has been reviewed 101 times, then best fit D will have to go through 100 reviews, which will be 9•100=900 microseconds, or 0.9 milliseconds. Let's assume that 5 iterations are necessary for convergence, so that's 9•100•5=4500 microseconds, or 4.5 milliseconds. This means that this could be done without making Anki laggy when the user is pressing "Show answer". Though it would make optimization much longer.
It's the performance of inference, not the performance of training, which is much slower.
One way of optimizing this that I can think of is saving each memory state (DSR values), and for cards with very long histories calculating best fit D using only 5-10 most recent reviews. This will greatly decrease CPU time, but increase the amount of memory needed.
I thought about the idea above some more. Since only 5-10 most recent reviews matter (the rest will have very small weights and won't contribute much to the loss), we only need to store at most 10 DSR values + 10 delta t values + 10 grade values. We don't actually need to store the entire history. Unless I'm wrong about how FSRS works, it should be possible to initialize it from any arbitrary point in the card's history, as long as D, S, R, delta_t and grade are available at that point. This reduces time complexity back to O(1), since the optimization procedure only has to go through a fixed number of reviews. It's also O(1) in memory complexity since we need to store a fixed amount of datapoints, not the entire history of the card. Initially (in the comment above) I thought that we need to store the entire history so we can use 5-10 most recent reviews. This would have O(n) memory complexity. But why store the entire history if we can initialize FSRS from any arbitrary point as long as D, S, R, delta_t and grade are available? This, of course, relies on the assumption that using <=10 most recent reviews will produce the same accuracy as using the full history, but that seems like a reasonable assumption to me.
I suggest we wait to see how this benchmarks before thinking about such things. Neither requiring revlogs at review time or needing to pack more data into each card is desirable, so this will need to be a significant improvement.
- Go back through the card's history and use D_1 in formulas for S and PLS
If the card's history includes 5 reviews, does the D_1 here remain unchanged?
D_n remains unchanged during each iteration, where n is the number of the iteration. Say, for example, that D_1=5, first iteration. This means that we need to go through the card's history and use this value in formulas for S and PLS. For each review we record the loss, which depends on predicted R and the real outcome. So suppose the average loss is 0.5 for D_1=5. Then the iteration ends and the next iteration begins. Suppose D_2 is now equal to 7. We go through the card's history again, record the values of loss, and suppose now it's 0.4. So the loss decreased. The iteration ends, we adjust difficulty again, use D_3, and so on, until some convergence criteria is fulfilled, for example, until the absolute difference between values of D_n becomes <0.01.
scipy.optimize.minimize
I find that this method doesn't support batch optimization. Our default batch size is 512.
I write a demo script:
import torch
import torch.nn as nn
from torch import Tensor
from typing import List
def power_forgetting_curve(t, s):
return (1 + t / (9 * s)) ** -1
class FSRS(nn.Module):
def __init__(self, w: List[float]):
super(FSRS, self).__init__()
self.w = nn.Parameter(torch.tensor(w, dtype=torch.float32))
def stability_after_success(
self, stabilities: Tensor, difficulties: Tensor, r: Tensor, rating: Tensor
) -> Tensor:
hard_penalty = torch.where(rating == 2, self.w[15], 1)
easy_bonus = torch.where(rating == 4, self.w[16], 1)
new_s = stabilities * (
1
+ torch.exp(self.w[8])
* (11 - difficulties)
* torch.pow(stabilities, -self.w[9])
* (torch.exp((1 - r) * self.w[10]) - 1)
* hard_penalty
* easy_bonus
)
return new_s
def stability_after_failure(
self, stabilities: Tensor, difficulties: Tensor, r: Tensor
) -> Tensor:
new_s = (
self.w[11]
* torch.pow(difficulties, -self.w[12])
* (torch.pow(stabilities + 1, self.w[13]) - 1)
* torch.exp((1 - r) * self.w[14])
)
return new_s
def step(self, X: Tensor, stabilities: Tensor, difficulties: Tensor) -> Tensor:
"""
:param X: shape[batch_size, 2], X[:,0] is elapsed time, X[:,1] is rating
:param state: shape[batch_size, 2], state[:,0] is stability, state[:,1] is difficulty
:return state:
"""
if torch.equal(stabilities, torch.zeros_like(stabilities)):
keys = torch.tensor([1, 2, 3, 4])
keys = keys.view(1, -1).expand(X[:, 1].long().size(0), -1)
index = (X[:, 1].long().unsqueeze(1) ==
keys).nonzero(as_tuple=True)
# first learn, init memory states
new_s = torch.ones_like(stabilities)
new_s[index[0]] = self.w[index[1]]
else:
r = power_forgetting_curve(X[:, 0], stabilities)
condition = X[:, 1] > 1
new_s = torch.where(
condition,
self.stability_after_success(stabilities, difficulties, r, X[:, 1]),
self.stability_after_failure(stabilities, difficulties, r),
)
new_s = new_s.clamp(0.1, 36500)
return new_s
def forward(self, inputs: Tensor, difficulties: Tensor) -> Tensor:
"""
:param inputs: shape[seq_len, batch_size, 2]
:param difficulties: shape[batch_size, 1]
"""
stabilities = torch.zeros_like(difficulties)
outputs = []
for X in inputs:
stabilities = self.step(X, stabilities, difficulties)
outputs.append(stabilities)
return torch.stack(outputs), stabilities
init_w = [
0.4,
0.9,
2.3,
10.9,
4.93,
0.94,
0.86,
0.01,
1.49,
0.14,
0.94,
2.18,
0.05,
0.34,
1.26,
0.29,
2.61,
]
model = FSRS(init_w)
loss_fn = nn.BCELoss(reduction="none")
t_history = "0,1,2,4,8,16"
r_history = "1,3,3,3,3,3"
t_tensor = torch.tensor([int(t) for t in t_history.split(",")[:-1]], dtype=torch.float32)
r_tensor = torch.tensor([int(r) for r in r_history.split(",")[:-1]], dtype=torch.float32)
t_next_tensor = torch.tensor([int(t) for t in t_history.split(",")[1:]], dtype=torch.float32)
y_next_tensor = torch.tensor([0 if r == "1" else 1 for r in r_history.split(",")[1:]], dtype=torch.float32)
def loss(difficulty):
with torch.no_grad():
inputs = torch.stack([t_tensor, r_tensor], dim=1).unsqueeze(1)
difficulties = torch.tensor([difficulty], dtype=torch.float32).unsqueeze(1)
stabilities, _ = model(inputs, difficulties)
retentions = power_forgetting_curve(t_next_tensor, stabilities.squeeze())
return loss_fn(retentions, y_next_tensor).mean()
low = 1
high = 10
best_d = 5
while high - low > 0.01:
mid1 = low + (high - low) / 3
mid2 = high - (high - low) / 3
loss1 = loss(mid1)
loss2 = loss(mid2)
if loss1 < loss2:
high = mid2
else:
low = mid1
best_d = (low + high) / 2
best_d
The interesting thing is, the best_d
output is binary, i.e. 1 or 10.
The interesting thing is, the
best_d
output is binary, i.e. 1 or 10.
Well, time to add grade-derived R then. If you need reasonable default values, I suggest 0.7 for Hard and 0.95 for Good, based on my own data.
Also, remember that BCE loss behaves strangely when labels aren't 1 and 0, so WozLoss is necessary.
I guess it is intended, because Woz said that:
As Algorithm SM-17 would match repetition history to recall predictions, only the timing of failing repetitions would decide item difficulty. This resulted in clustering items in very difficult or very easy quantiles with fewer values in between. Source: https://supermemo.guru/wiki/Item_difficulty_in_Algorithm_SM-18#Item_difficulty_in_Algorithm_SM-18
By the way, I tested using grade-derived R with your code, it makes D non-binary.
grade_derived_r = dict({1: 0, 2: 0.7, 3: 0.95, 4: 1})
y_next_tensor = torch.tensor([grade_derived_r.get(int(r)) for r in r_history.split(",")[1:]], dtype=torch.float32)
For example: t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,3" D=3.97
t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,4" D=1.20
t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,2" D=8.57
I still used BCE loss though, but in the final version WozLoss should be used.
Interestingly, with WozLoss the results are quite different when I change the last grade to 2. The values of D for the first two histories are at least somewhat similar, but the value of D for the last history is completely different.
loss = -torch.log(1-torch.abs(retentions-y_next_tensor))
t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,3" D=4.28
t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,4" D=2.65
t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,2" D=4.54
Slightly unrelated, but I'm really curious to see what will be the median value of a in loss = a * WozLoss(r, binary_recall) + (1-a) * WozLoss(r, grade_derived_r)
, once we run the optimizer on all collections.
Out of curiosity, I decided to try to use a different method of finding best fit D. I used the improved secant method from this paper: https://vixra.org/pdf/1405.0013v1.pdf.
def improved_secant(x0, e, N):
# x0 is the initial guess
# e must be a small positive number, for example, 0.01
# N is the max. number of iterations
iter = 1
while True:
a = abs(loss(x0)) / 2
x1 = max(min(x0 + a - a * (loss(x0 + a) / (loss(x0 + a) - loss(x0))), 10), 1)
if abs(x0 - x1) < e:
return x1
else:
x0 = x1
iter += 1
if iter > N:
print('Failed to converge')
break
But according to my testing, it converges only when the minimum value of loss(D) is 0 for some D. Regular secant method seems to only return 1 or 10, unless I artifically modify the loss to reach 0, and it's slower, so neither of them are usable. I can't think of any way to ensure that condition, since the minimum value of the loss depends on the card's history and on a
in loss = a * WozLoss(r, binary_recall) + (1-a) * WozLoss(r, grade_derived_r)
. Even if only grade-derived loss is used, I still can't think of a way to ensure that the condition is always satisfied. This is unfortunate, because when modified secant method works, it's about 3 times faster than your loop.
I "made" an algorithm that is roughly 3-3.5 times faster than the current loop, and by "made" I mean "copied scipy.optimize.brent and removed a bunch of stuff and modified it to work with our case", and I also have an idea how to greatly reduce the time in certain cases, specifically when the optimal value of D is 1 or 10, so that we have to run the optimization procedure only when the optimal value is somewhere inbetween. It's pretty long and speed is not the main focus right now since we don't even have a proper implementation yet, so I'll post it later. Btw, that algorithm might be useful for finding optimal retention to speed it up a bit. It can be used for any function that has one variable and bounds.
Yeah...mini-batch optimization is definitely required.
@L-M-Sherlock will you keep working on this?
The training costs me 3 days, which are impractical.
Well, if you can't think of any way to speed it up, then I suppose you can close this issue.
This will obviously be very computationally expensive since we are running an optimization procedure after every review, but I still want to try it. There is one more problem. Since Hard, Good and Easy all count as 1 when it comes to recall, this history: 3,3,3 And this history: 4,4,4 Will result in exactly the same value of D. Solution: derive R from grades and use a weighted average of two losses, like this:
loss = a * WozLoss(r, binary_recall) + (1-a) * WozLoss(r, grade_derived_r)
I could try to implement this myself, but there is one major obstacle (well, a lot of obstacles): I have no idea how to make a function that goes through the card's entire history, calculates the value of the loss for each review, an then outputs it, and it has to be withinstep()
, something like this:Originally posted by @Expertium in https://github.com/open-spaced-repetition/fsrs4anki/issues/461#issuecomment-1792359565
Simply put, the key idea is to perform an optimization procedure after each review to find a value of D that provides the best fit to this card's review history.
EDIT: there should be 2 optimizable parameters. The first one is a in
loss = a * WozLoss(r, binary_recall) + (1-a) * WozLoss(r, grade_derived_r)
, 0<a<1. The second one is alpha for exponential weights for each value of the loss, it determines how quickly weights decrease as we get further into the past, 0<alpha<1. See my next comment, I gave a very simple example of what exponential weights would look like.