open-spaced-repetition / fsrs4anki

A modern Anki custom scheduling based on Free Spaced Repetition Scheduler algorithm
https://github.com/open-spaced-repetition/fsrs4anki/wiki
MIT License
2.67k stars 130 forks source link

[Feature request] Best fit D #520

Closed Expertium closed 10 months ago

Expertium commented 11 months ago
          @L-M-Sherlock I really want to test my idea of best fit difficulty, but it's too comples for me to implement. I've described it before, but I'll describe it here again:
  1. For the first review, just calculate initial D using a formula
  2. For consecutive reviews, do the following steps
  3. Choose a value of D between 1 and 10, let's call it D_1
  4. Go back through the card's history and use D_1 in formulas for S and PLS
  5. For each review, record the WozLoss -ln(1-abs(p-r)) between predicted R and actual outcome
  6. Weigh each value of the loss by recency, so that more recent reviews and their respective loss values have a greater weight
  7. Calculate weighted average loss
  8. Return it's value
  9. Adjust D_1, making it D_2. Repeat steps 3-8
  10. Repeat this for several iterations until a value of D that minimizes the loss is found

This will obviously be very computationally expensive since we are running an optimization procedure after every review, but I still want to try it. There is one more problem. Since Hard, Good and Easy all count as 1 when it comes to recall, this history: 3,3,3 And this history: 4,4,4 Will result in exactly the same value of D. Solution: derive R from grades and use a weighted average of two losses, like this: loss = a * WozLoss(r, binary_recall) + (1-a) * WozLoss(r, grade_derived_r) I could try to implement this myself, but there is one major obstacle (well, a lot of obstacles): I have no idea how to make a function that goes through the card's entire history, calculates the value of the loss for each review, an then outputs it, and it has to be within step(), something like this:

def best_fit_d(self, X: Tensor, state: Tensor, D: Tensor):  # this goes through the card's entire history and plugs the selected value of D into all formulas, and then returns the value of the loss
        ....

            r = power_forgetting_curve(X[:, 0], state[:, 0])
            new_d = scipy.optimize.minimize(best_fit_d, x0=state[:,1], bounds=((1, 10),))
            condition = X[:, 1] > 1
            new_s = torch.where(
                condition,
                self.stability_after_success(state, new_d, r, X[:, 1]),
                self.stability_after_failure(state, new_d, r),
            )

Originally posted by @Expertium in https://github.com/open-spaced-repetition/fsrs4anki/issues/461#issuecomment-1792359565

Simply put, the key idea is to perform an optimization procedure after each review to find a value of D that provides the best fit to this card's review history.

EDIT: there should be 2 optimizable parameters. The first one is a in loss = a * WozLoss(r, binary_recall) + (1-a) * WozLoss(r, grade_derived_r), 0<a<1. The second one is alpha for exponential weights for each value of the loss, it determines how quickly weights decrease as we get further into the past, 0<alpha<1. See my next comment, I gave a very simple example of what exponential weights would look like.

Expertium commented 11 months ago

For grade-derived R, I have this code (it's mostly leftover from that one time we tried making a weighted average of predicted and grade-derived R):

from scipy.optimize import curve_fit

dataset = pd.read_csv("./revlog_history.tsv", sep='\t', index_col=None, dtype={'r_history': str ,'t_history': str} )
dataset = dataset[(dataset['i'] > 1) & (dataset['delta_t'] > 0) & (dataset['t_history'].str.count(',0') == 0)]
dataset['y'] = dataset['r'].map({1: 0, 2: 1, 3: 1, 4: 1})
dataset['lapse'] = dataset['r_history'].str.count('1')
dataset['delta_t'] = dataset['delta_t'].map(lambda x: round(1.2 * math.pow(1.4, math.floor(math.log(x, 1.4))), 2))
dataset['i'] = dataset['i'].map(lambda x: round(1.115 * math.pow(1.1, math.floor(math.log(x, 1.1))), 2))
dataset['lapse'] = dataset['lapse'].map(lambda x: min(x, 8))

grade_r_matrix = dataset.groupby(by=["delta_t", "i", "lapse"]).agg({'y': ['count', 'mean'], 'r': 'mean'}).reset_index()
grade_r_matrix.sort_values(by=[('r', 'mean')], inplace=True, ignore_index=True)
grade_means = grade_r_matrix['r']['mean']
r_means = (grade_r_matrix['y']['mean'] * grade_r_matrix['y']['count'] + 1 * grade_r_matrix['y']['mean'].mean()) / (grade_r_matrix['y']['count'] + 1)
count = grade_r_matrix['y']['count']

def func(x, a, b):
    return 1-(np.power(1-np.power((x-1)/3, a), b))

params, covs = curve_fit(func, grade_means, r_means, bounds=((0.25, 0.25), (4, 4)), sigma=1/np.sqrt(count))
grade_derived_r_values = [func(x, *params) for x in [1, 2, 3, 4]]
grade_derived_r = dict(zip([1, 2, 3, 4], grade_derived_r_values))

The value for Again is always 0, the value for Easy is always 1. These 4 values will then be used in the hybrid loss function for optimization. Also, I tried using minimize instead of curve_fit, but for some reason it always sets parameters to their maximum or minimum values that are allowed by bounds. As for weighing by recency, I suggest something like this:

alpha = 0.9
exp_weights = [alpha ** x for x in range(0, 10)]

Every value of the loss will be weighted based on how recent that review was. The most recent review will always have a weight of 1.0, more distant past reviews will have weight <1. This way D will depend more on recent reviews. Alpha will be an optimizable parameter.

L-M-Sherlock commented 11 months ago

The main problem of Best fit D is it breaks the Markov property of FSRS. In FSRS, the current memory state only relies on the current input (delta_t, rating) and the last state (stability, difficulty). But Best fit D requires the entire history.

Expertium commented 11 months ago

Is it that important? I mean, from a practical point of view, not theoretical.

L-M-Sherlock commented 11 months ago

It's important. The current implementation of FSRS in Anki doesn't require the full review history when updating the memory state. If we breaks the Markov property, the time complexity of inference will jump from O(1) to O(n). And the time complexity of training will jump from O(n^2) to O(n^3).

Expertium commented 11 months ago

Makes sense, though in practice it's possible that the amount of time required for inference and training will still be acceptable. But I don't think Dae will approve of this change, as it would require compeletely changing what data is available to the scheduler and how it's processed. Just to be perfectly clear, do you think that right now there is any reason to work on best fit D, or do you think that there is absolutely no reason whatsoever to work on it? EDIT: I would still like to see how well it performs in the benchmark, just to see the potential.

Expertium commented 11 months ago

Wait, why is training O(n^2)? Shouldn't it be O(mn), where m is the number of epochs and n is the number of reviews?

L-M-Sherlock commented 11 months ago

Wait, why is training O(n^2)? Shouldn't it be O(mn), where m is the number of epochs and n is the number of reviews?

If a card has 9 review logs, it will generate 9 training samples. And a sample of length 9 requires the model to update the state 9 times. So it has 9 + 8 + 7 + … + 1 = 45 times of calculation.

image

L-M-Sherlock commented 11 months ago

If we introduce Best fit D, a sample of length 9 requires the model update the state 9 times. And each update requires more calculation.

L-M-Sherlock commented 11 months ago

If you want to see the potential, we can use transformer. Its time complexity is O(n^2), too. ChatGPT is powered by transformer. But it is hard to train.

Expertium commented 11 months ago

Yeah, going the neural network route kinda defeats the whole point of FSRS. It's not interpretable by humans and it has orders of magnitude more parameters.

Expertium commented 11 months ago

I wonder how SuperMemo does it. My definition of best fit D is almost entirely based on the article about SM-17. So it works fine for SM-17.

Expertium commented 11 months ago

Out of curiosity, do you know how long it takes to calculate DSR for one review, in Rust version, on your PC? Like, 1 millisecond? 100 microseconds?

L-M-Sherlock commented 11 months ago

I wonder how SuperMemo does it. My definition of best fit D is almost entirely based on the article about SM-17. So it works fine for SM-17.

But it is replaced in SM-18: https://supermemo.guru/wiki/Item_difficulty_in_Algorithm_SM-18

Out of curiosity, do you know how long it takes to calculate DSR for one review, in Rust version, on your PC? Like, 1 millisecond? 100 microseconds?

The benchmark of performance is implemented by @dae in https://github.com/open-spaced-repetition/fsrs-rs/pull/59.

Expertium commented 11 months ago

open-spaced-repetition/fsrs-rs#59

I can't find any info about how long it takes to calculate DSR values

L-M-Sherlock commented 11 months ago

I don't know how to run the benchmark, so I recommend asking @dae.

Expertium commented 11 months ago

It's just that I was thinking about the worst-case scenario. Suppose it takes 1 millisecond to calculate DSR values. Now suppose a card has been reviewed 101 times, so best fit D would have to go through 100 reviews (minus the first review). That's 100 milliseconds. And it likely won't converge in just one iteration, so let's say 5 iterations are required to guarantee convergence. That's 500 milliseconds between the user pressing "Show answer" and the answer buttons with their intervals being displayed. That's unacceptable. So unless DSR values take <100 microseconds to calculate, this probably won't be practical.

dae commented 11 months ago

Currently study does not access the revlogs, so this would be adding a bunch of extra I/O and logic as well. The gains from it would have to be pretty big to justify the change.

user1823 commented 11 months ago

The main problem of Best fit D is it breaks the Markov property of FSRS.

I thought that the plan was to use the best fit D approach to generate a set of optimal D values, try to build a function that approximates these values well and then use that function in the release version (not the optimization within optimization approach).

Here is a quote from @Expertium:

what if we did that in the optimizer, exported the resulting data (repetition history with best fit D for each review), and tried to come up with a function that approximates it really well? That's my idea. Compute best fit D, no matter how long it takes, and then try to find an approximation for it.

[...]

After this monstrosity is optimized, a file with the entire repetition history (delta_t, ratings, all that stuff) and best fit D should be available for download. Then you and I will download it, look at the values, plot them, analyze them however we want, and try to come up with a function that doesn't require this whole "optimization within optimization" crap, yet behaves similarly.

Originally posted by @Expertium in https://github.com/open-spaced-repetition/fsrs4anki/issues/282#issuecomment-1627781901

Expertium commented 11 months ago

I think that it won't work. Approximating best fit D will likely result in something far less accurate, negating the benefit

Expertium commented 11 months ago

If a card has 9 review logs, it will generate 9 training samples. And a sample of length 9 requires the model to update the state 9 times. So it has 9 + 8 + 7 + … + 1 = 45 times of calculation.

Maybe I'm misunderstanding something, but doesn't it mean that the first few reviews will affect the weights much more than the following reviews?

If the loss is calculated 9 times for the first review, it effectively means that the first review has a weight of 9, and if the loss is calculated only once for the last review, it effectively has a weight of 1.

Expertium commented 11 months ago

So according to this, it takes about 9 microseconds per review. image That's actually great! @L-M-Sherlock so going back to my worst case scenario above, if a card has been reviewed 101 times, then best fit D will have to go through 100 reviews, which will be 9•100=900 microseconds, or 0.9 milliseconds. Let's assume that 5 iterations are necessary for convergence, so that's 9•100•5=4500 microseconds, or 4.5 milliseconds. This means that this could be done without making Anki laggy when the user is pressing "Show answer". Though it would make optimization much longer. I would be glad it you started working on best fit D. I want to see a "proof of concept", in other words, I want to see if this idea has any merit if we ignore time constraints. If it improves RMSE greatly, then we can start thinking how to optimize it. If it barely improves RMSE, we can abandon this idea.

L-M-Sherlock commented 11 months ago

OK. I will try to implement the Best fit D in the Python Optimizer for testing its potential.

L-M-Sherlock commented 11 months ago

@L-M-Sherlock so going back to my worst case scenario above, if a card has been reviewed 101 times, then best fit D will have to go through 100 reviews, which will be 9•100=900 microseconds, or 0.9 milliseconds. Let's assume that 5 iterations are necessary for convergence, so that's 9•100•5=4500 microseconds, or 4.5 milliseconds. This means that this could be done without making Anki laggy when the user is pressing "Show answer". Though it would make optimization much longer.

It's the performance of inference, not the performance of training, which is much slower.

Expertium commented 11 months ago

One way of optimizing this that I can think of is saving each memory state (DSR values), and for cards with very long histories calculating best fit D using only 5-10 most recent reviews. This will greatly decrease CPU time, but increase the amount of memory needed.

Expertium commented 11 months ago

I thought about the idea above some more. Since only 5-10 most recent reviews matter (the rest will have very small weights and won't contribute much to the loss), we only need to store at most 10 DSR values + 10 delta t values + 10 grade values. We don't actually need to store the entire history. Unless I'm wrong about how FSRS works, it should be possible to initialize it from any arbitrary point in the card's history, as long as D, S, R, delta_t and grade are available at that point. This reduces time complexity back to O(1), since the optimization procedure only has to go through a fixed number of reviews. It's also O(1) in memory complexity since we need to store a fixed amount of datapoints, not the entire history of the card. Initially (in the comment above) I thought that we need to store the entire history so we can use 5-10 most recent reviews. This would have O(n) memory complexity. But why store the entire history if we can initialize FSRS from any arbitrary point as long as D, S, R, delta_t and grade are available? This, of course, relies on the assumption that using <=10 most recent reviews will produce the same accuracy as using the full history, but that seems like a reasonable assumption to me.

dae commented 11 months ago

I suggest we wait to see how this benchmarks before thinking about such things. Neither requiring revlogs at review time or needing to pack more data into each card is desirable, so this will need to be a significant improvement.

L-M-Sherlock commented 11 months ago
  • Go back through the card's history and use D_1 in formulas for S and PLS

If the card's history includes 5 reviews, does the D_1 here remain unchanged?

Expertium commented 11 months ago

D_n remains unchanged during each iteration, where n is the number of the iteration. Say, for example, that D_1=5, first iteration. This means that we need to go through the card's history and use this value in formulas for S and PLS. For each review we record the loss, which depends on predicted R and the real outcome. So suppose the average loss is 0.5 for D_1=5. Then the iteration ends and the next iteration begins. Suppose D_2 is now equal to 7. We go through the card's history again, record the values of loss, and suppose now it's 0.4. So the loss decreased. The iteration ends, we adjust difficulty again, use D_3, and so on, until some convergence criteria is fulfilled, for example, until the absolute difference between values of D_n becomes <0.01.

L-M-Sherlock commented 11 months ago

scipy.optimize.minimize

I find that this method doesn't support batch optimization. Our default batch size is 512.

L-M-Sherlock commented 11 months ago

I write a demo script:

import torch
import torch.nn as nn
from torch import Tensor
from typing import List

def power_forgetting_curve(t, s):
    return (1 + t / (9 * s)) ** -1

class FSRS(nn.Module):
    def __init__(self, w: List[float]):
        super(FSRS, self).__init__()
        self.w = nn.Parameter(torch.tensor(w, dtype=torch.float32))

    def stability_after_success(
        self, stabilities: Tensor, difficulties: Tensor, r: Tensor, rating: Tensor
    ) -> Tensor:
        hard_penalty = torch.where(rating == 2, self.w[15], 1)
        easy_bonus = torch.where(rating == 4, self.w[16], 1)
        new_s = stabilities * (
            1
            + torch.exp(self.w[8])
            * (11 - difficulties)
            * torch.pow(stabilities, -self.w[9])
            * (torch.exp((1 - r) * self.w[10]) - 1)
            * hard_penalty
            * easy_bonus
        )
        return new_s

    def stability_after_failure(
        self, stabilities: Tensor, difficulties: Tensor, r: Tensor
    ) -> Tensor:
        new_s = (
            self.w[11]
            * torch.pow(difficulties, -self.w[12])
            * (torch.pow(stabilities + 1, self.w[13]) - 1)
            * torch.exp((1 - r) * self.w[14])
        )
        return new_s

    def step(self, X: Tensor, stabilities: Tensor, difficulties: Tensor) -> Tensor:
        """
        :param X: shape[batch_size, 2], X[:,0] is elapsed time, X[:,1] is rating
        :param state: shape[batch_size, 2], state[:,0] is stability, state[:,1] is difficulty
        :return state:
        """
        if torch.equal(stabilities, torch.zeros_like(stabilities)):
            keys = torch.tensor([1, 2, 3, 4])
            keys = keys.view(1, -1).expand(X[:, 1].long().size(0), -1)
            index = (X[:, 1].long().unsqueeze(1) ==
                     keys).nonzero(as_tuple=True)
            # first learn, init memory states
            new_s = torch.ones_like(stabilities)
            new_s[index[0]] = self.w[index[1]]
        else:
            r = power_forgetting_curve(X[:, 0], stabilities)
            condition = X[:, 1] > 1
            new_s = torch.where(
                condition,
                self.stability_after_success(stabilities, difficulties, r, X[:, 1]),
                self.stability_after_failure(stabilities, difficulties, r),
            )
        new_s = new_s.clamp(0.1, 36500)
        return new_s

    def forward(self, inputs: Tensor, difficulties: Tensor) -> Tensor:
        """
        :param inputs: shape[seq_len, batch_size, 2]
        :param difficulties: shape[batch_size, 1]
        """
        stabilities = torch.zeros_like(difficulties)
        outputs = []
        for X in inputs:
            stabilities = self.step(X, stabilities, difficulties)
            outputs.append(stabilities)
        return torch.stack(outputs), stabilities

init_w = [
    0.4,
    0.9,
    2.3,
    10.9,
    4.93,
    0.94,
    0.86,
    0.01,
    1.49,
    0.14,
    0.94,
    2.18,
    0.05,
    0.34,
    1.26,
    0.29,
    2.61,
]

model = FSRS(init_w)
loss_fn = nn.BCELoss(reduction="none")

t_history = "0,1,2,4,8,16"
r_history = "1,3,3,3,3,3"

t_tensor = torch.tensor([int(t) for t in t_history.split(",")[:-1]], dtype=torch.float32)
r_tensor = torch.tensor([int(r) for r in r_history.split(",")[:-1]], dtype=torch.float32)
t_next_tensor = torch.tensor([int(t) for t in t_history.split(",")[1:]], dtype=torch.float32)
y_next_tensor = torch.tensor([0 if r == "1" else 1 for r in r_history.split(",")[1:]], dtype=torch.float32)

def loss(difficulty):
    with torch.no_grad():
        inputs = torch.stack([t_tensor, r_tensor], dim=1).unsqueeze(1)
        difficulties = torch.tensor([difficulty], dtype=torch.float32).unsqueeze(1)
        stabilities, _ = model(inputs, difficulties)
        retentions = power_forgetting_curve(t_next_tensor, stabilities.squeeze())
        return loss_fn(retentions, y_next_tensor).mean()

low = 1
high = 10
best_d = 5
while high - low > 0.01:
    mid1 = low + (high - low) / 3
    mid2 = high - (high - low) / 3

    loss1 = loss(mid1)
    loss2 = loss(mid2)

    if loss1 < loss2:
        high = mid2
    else:
        low = mid1

    best_d = (low + high) / 2

best_d

The interesting thing is, the best_d output is binary, i.e. 1 or 10.

Expertium commented 11 months ago

The interesting thing is, the best_d output is binary, i.e. 1 or 10.

Well, time to add grade-derived R then. If you need reasonable default values, I suggest 0.7 for Hard and 0.95 for Good, based on my own data.

Expertium commented 11 months ago

Also, remember that BCE loss behaves strangely when labels aren't 1 and 0, so WozLoss is necessary.

L-M-Sherlock commented 11 months ago

I guess it is intended, because Woz said that:

As Algorithm SM-17 would match repetition history to recall predictions, only the timing of failing repetitions would decide item difficulty. This resulted in clustering items in very difficult or very easy quantiles with fewer values in between. Source: https://supermemo.guru/wiki/Item_difficulty_in_Algorithm_SM-18#Item_difficulty_in_Algorithm_SM-18

Expertium commented 11 months ago

By the way, I tested using grade-derived R with your code, it makes D non-binary.

grade_derived_r = dict({1: 0, 2: 0.7, 3: 0.95, 4: 1})
y_next_tensor = torch.tensor([grade_derived_r.get(int(r)) for r in r_history.split(",")[1:]], dtype=torch.float32)

For example: t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,3" D=3.97

t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,4" D=1.20

t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,2" D=8.57

I still used BCE loss though, but in the final version WozLoss should be used.

Expertium commented 11 months ago

Interestingly, with WozLoss the results are quite different when I change the last grade to 2. The values of D for the first two histories are at least somewhat similar, but the value of D for the last history is completely different.

loss = -torch.log(1-torch.abs(retentions-y_next_tensor))

t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,3" D=4.28

t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,4" D=2.65

t_history = "0,1,2,4,8,16" r_history = "1,3,3,3,3,2" D=4.54

Slightly unrelated, but I'm really curious to see what will be the median value of a in loss = a * WozLoss(r, binary_recall) + (1-a) * WozLoss(r, grade_derived_r), once we run the optimizer on all collections.

Expertium commented 11 months ago

Out of curiosity, I decided to try to use a different method of finding best fit D. I used the improved secant method from this paper: https://vixra.org/pdf/1405.0013v1.pdf.

def improved_secant(x0, e, N):
    # x0 is the initial guess
    # e must be a small positive number, for example, 0.01
    # N is the max. number of iterations
    iter = 1
    while True:
        a = abs(loss(x0)) / 2
        x1 = max(min(x0 + a - a * (loss(x0 + a) / (loss(x0 + a) - loss(x0))), 10), 1)
        if abs(x0 - x1) < e:
            return x1
        else:
            x0 = x1

        iter += 1
        if iter > N:
            print('Failed to converge')
            break

But according to my testing, it converges only when the minimum value of loss(D) is 0 for some D. Regular secant method seems to only return 1 or 10, unless I artifically modify the loss to reach 0, and it's slower, so neither of them are usable. I can't think of any way to ensure that condition, since the minimum value of the loss depends on the card's history and on a in loss = a * WozLoss(r, binary_recall) + (1-a) * WozLoss(r, grade_derived_r). Even if only grade-derived loss is used, I still can't think of a way to ensure that the condition is always satisfied. This is unfortunate, because when modified secant method works, it's about 3 times faster than your loop.

Expertium commented 11 months ago

I "made" an algorithm that is roughly 3-3.5 times faster than the current loop, and by "made" I mean "copied scipy.optimize.brent and removed a bunch of stuff and modified it to work with our case", and I also have an idea how to greatly reduce the time in certain cases, specifically when the optimal value of D is 1 or 10, so that we have to run the optimization procedure only when the optimal value is somewhere inbetween. It's pretty long and speed is not the main focus right now since we don't even have a proper implementation yet, so I'll post it later. Btw, that algorithm might be useful for finding optimal retention to speed it up a bit. It can be used for any function that has one variable and bounds.

L-M-Sherlock commented 11 months ago

image

Expertium commented 11 months ago

Yeah...mini-batch optimization is definitely required.

Expertium commented 10 months ago

@L-M-Sherlock will you keep working on this?

L-M-Sherlock commented 10 months ago

The training costs me 3 days, which are impractical.

Expertium commented 10 months ago

Well, if you can't think of any way to speed it up, then I suppose you can close this issue.