[Enhancement] Improving the function for calculating difficulty

user1823 commented 1 year ago

The current formula for D in FSRS is just a heuristic one. We have already tried a lot of approaches to improve the calculation of difficulty. But all of them failed.

While developing a better function for D, we should keep in mind the following:

Calculation of difficulty after review should depend upon the retrievability, just like in SuperMemo algorithms. If the R at the time of review is not considered, the model effectively learns Anki's scheduler and not the actual patterns of forgetting. (Past discussion about this can be found here: https://github.com/open-spaced-repetition/fsrs4anki/issues/248)
There shouldn't be clustering of heterogeneous reviews at maximum difficulty. Based on the current formula, many of my reviews would have a difficulty of more than 10. Clamping of the difficulty caused the difficulty of heterogeneous reviews to become the same (i.e., 10). (Past discussion about this can be found here: https://github.com/open-spaced-repetition/fsrs4anki/issues/262)
It should try to incorporate the concept of mean reversion within the main function of D.
If Again is the first rating, it should not have as much impact on the D as the lapses have. Reason: You might have created the card several weeks ago and you have almost no remaining memory of it when you review it for the first time. The extreme case would be where the card was created by someone else, and the first review is the first time you encounter the information. So, an Again rating here doesn't mean that the card is difficult.
Variants of the easy and hard factors might also need to be included in the formula for D.

Expertium commented 1 year ago

Calculation of difficulty after review should depend upon the retrievability, just like in SuperMemo algorithms.

While that would be great, all my attempts to integrate R into D have failed, and my attempts to change the formula completely and use a formula based on B-W (like in SM-18) have also failed.

There shouldn't be clustering of heterogeneous reviews at maximum difficulty.

That depends on the material though. If you are learning something that is very hard, why wouldn't most values of D be close to 10?

It should try to incorporate the concept of mean reversion within the main function of D.

Honestly, I'm not convinced that reversion to a fixed value is a good idea. I've experimented with reversing to a value that changes, making so that instead of all cards reversing to the same "mean" value, each card would have it's own "mean", but that also didn't work very well.

If Again is the first rating, it should not have as much impact on the D as the lapses have. Reason: You might have created the card several weeks ago and you have almost no remaining memory of it when you review it for the first time. The extreme case would be where the card was created by someone else, and the first review is the first time you encounter the information. So, an Again rating here doesn't mean that the card is difficult.

That's an interesting idea.

Variants of the easy and hard factors might also need to be included in the formula for D.

Since optimizable grades didn't work, I'm willing to bet this won't work either, since it's basically the same idea.

Honestly, I'm pretty tired of trying to improve D, there are other things I want to work on, like the matrices.

hydrogs commented 1 year ago

I guess a good way to take $R$ into account is through a surprise function. The less predictable the grade $G$ is, the bigger should be the change in difficulty. This way, when retrievability is low, a lapse should no have as much effect on $D$ as when $R$ is high, since the lapse was not expected. We can say that if the algorithm was right, $D$ shouldn't be updated as it reflected the intrinsic difficulty of the card and vice versa.

A surprise function I've come up with is $A(R, G) = e^{-(R-0.5) \cdot (G-2)}$ ($A$ for astonishment, since $S$ is already taken). When the two parts of the exponent are of the same sign (low surprise), the function is small; and when the parts are different (what happened and what was expected differed), the function will be high.

The way I think this new function should be used is as a new weight in the weighted average that the $D$ function is already. Instead of two parts —the reversion to avoid "ease hell" $D' = D_0(3)$ and the computed $D' = D - w_6 \cdot (G-3)$ —, I would add a third part, keeping the current difficulty as a good one $D' = D$.

In order to use the surprise function as a weight it should be normalize $ke^{-(R-0.5) \cdot (G-2)}$ where $k = e^{-1} = \frac{1}{e^{-(0-0.5)\cdot(4-2)}}$, the largest value the function can take. So it finally remains as $A(R, G) = e^{-1-(R-0.5) \cdot (G-2)}$.

Finally, we can connect the three posibilities this way:

$$D'(D, G, R) = w_7 \cdot D_0(3) + (1-w_7) \cdot \left(A \cdot (D - w_6 \cdot (G-3)) + (1-A) \cdot D \right)$$

I have chosen 2 as the midpoint grade of the surprise function as it is usually not seen as a success nor as a lapse. Another value, such as 2.5 could be chosen. It is also a good place to insert a new learned weight in order to accomodate to the use of Hard of the different users $A(R, G) = e^{-1-(R-0.5) \cdot (G-2-w_a)}$.

Expertium commented 1 year ago

@L-M-Sherlock please make a version of the optimizer with visible code so I can test this idea? I need it for other testing as well, I can't test much if I can't change anything about the code.

Expertium commented 1 year ago

I tested it using this file (I made a copy): https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/main/archive/candidate/fsrs4anki_optimizer_beta.ipynb I added 2 more parameters, like this: torch.exp(-(retrievability - self.w[15]) * (grade - self.w[16])) Also, I had to modify how k is calculated, with 2 more parameters it's not just e^-1. The results weren't statistically significant. The table below shows the average RMSE (normalized so that the RMSE of the original is 100%) and the p-value based on 15 decks/collections.

I'm getting more and more perplexed as to why nothing we do to modify the formula for D improves RMSE. I could make an entire list of ideas related to D, all of which failed. I still like the idea of using a complex approach to calculate D and then, if it works, trying to find a simple approximation for it. @L-M-Sherlock how about this: make a new file, but let a neural network (LSTM) calculate the values of D, see if it improves RMSE. If even a neural network can't find better values of D, then we might as well give up.

L-M-Sherlock commented 1 year ago

Maybe we can change our reasoning thread to another way. Why not use a similar method like fitting initial stability to determine the initial difficulty. Is it possible to find out the initial difficulty via the current data and formula? If we make progress in fitting the initial difficulty, we are more likely to find out the formula to update the initial difficulty.

Expertium commented 1 year ago

Initial D doesn't matter as much as initial S. I would really like to see a hybrid model, with all FSRS formulas, except that D is given by an LSTM. Though I'm not sure if it's possible in pytorch, you would have to optimize the parameters of FSRS and of LSTM simultaneously.

hydrogs commented 1 year ago

I have simplified the use of the surprise function and added some more weights.

$$ A(R, G) = e^{- w{18} \cdot (R - w{19}) \cdot (G - w_{20})}$$

$$ D'(D, G, R) = D - w6 \cdot A(R, G) \cdot (G - w{17}) $$

The surprise function is now just a companion to $w_6$ in order to determine the change rate of $D$. The $w_18$ is added in order to learn how steep should the surprise function be. It has never been close to $1$ in my optimizations, this may be why previous attemps where no succesful. In respect of the normalization with $k$, it is not longer needed since it can be done with the learned $w_6$.

I got a decrease in RMSE of 2% with my dataset. Specially a decrease of 5% with a deck that I stopped studying for some time and then returned, so the effect of different retrievabilities is bigger.

I'm not sure if the improvements are big enough. Actually, they should not be big for a dataset in which most reviews are done in time, since there is no variability in retrievability. This change becomes relevant in datasets where there have been a period long enough without studying in order to provide low retrievabilities.

This is the code I used if you want to test it with your datasets:

 def step(self, X: Tensor, state: Tensor) -> Tensor:
        '''
        :param X: shape[batch_size, 2], X[:,0] is elapsed time, X[:,1] is rating
        :param state: shape[batch_size, 2], state[:,0] is stability, state[:,1] is difficulty
        :return state:
        '''
        if torch.equal(state, torch.zeros_like(state)):
            keys = torch.tensor([1, 2, 3, 4])
            keys = keys.view(1, -1).expand(X[:,1].long().size(0), -1)
            index = (X[:,1].long().unsqueeze(1) == keys).nonzero(as_tuple=True)
            # first learn, init memory states
            new_s = torch.ones_like(state[:,0])
            new_s[index[0]] = self.w[index[1]]
            new_d = self.w[4] - self.w[5] * (X[:,1] - 3)
            new_d = new_d.clamp(1, 10)
        else:
            r = power_forgetting_curve(X[:,0], state[:,0])
            a = torch.exp( - self.w[18] * (r - self.w[19]) * (X[:,1] - self.w[20]))
            new_d = state[:,1] - self.w[6] * a * (X[:,1] - self.w[17])
            new_d = self.mean_reversion(self.w[4], new_d)
            new_d = new_d.clamp(1, 10)
            condition = X[:,1] > 1
            new_s = torch.where(condition, self.stability_after_success(state, new_d, r, X[:,1]), self.stability_after_failure(state, new_d, r))
        new_s = new_s.clamp(0.1, 36500)
        return torch.stack([new_s, new_d], dim=1)

user1823 commented 1 year ago

Actually, they should not be big for a dataset in which most reviews are done in time, since there is no variability in retrievability. This change becomes relevant in datasets where there have been a period long enough without studying in order to provide low retrievabilities.

Not exactly because three different algorithms are involved here, i.e., Anki SM-2, FSRS v3 and FSRS v4. The reviews were done according to the algorithm in use at that time. Now, all the reviews are analysed using FSRS v4. So, the R at the time of review would vary.

This is the code I used if you want to test it with your datasets:

Can you pull the latest commits from the main branch on open-spaced-repetition/fsrs-optimizer? It would make the comparisons fairer.

hydrogs commented 1 year ago

Can you pull the latest commits from the main branch on open-spaced-repetition/fsrs-optimizer? It would make the comparisons fairer.

Sorry, I forgot to update the branch. I guess I mixed it up in my mind with another fork i I was also updating.

Anyway, the improvement still holds, if you guys want to check with your datasets.

user1823 commented 1 year ago

I am not a dev and I am not very familiar with how GitHub works. But, I think that your changes are in the feat/difficulty branch and it is not up-to-date with the main branch at open-spaced-repetition/fsrs-optimizer.

hydrogs commented 1 year ago

omg, my bad :disappointed_relieved:. I was pulling into the main branch instead of the feat/dificulty. Sorry for the mistake, I am not a dev either.

The new commits don't change the result though. So the improvement still holds if you want to check. I don't think my dataset is representative enough.

Expertium commented 1 year ago

I don't know how to test it the "proper" way, so I tested it my way again: I made another copy of https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/main/archive/candidate/fsrs4anki_optimizer_beta.ipynb and implemented the code there. It performed about the same as before, the difference between RMSE of the modified version and the one above was not statistically significant. Again, I'm showing normalized average RMSE, stat. significance tests are done using raw RMSE. I'm mostly using my own decks (I have all kinds of material, both easy and hard) + a few collections from other users.

@L-M-Sherlock I am once again asking you to make a hybrid LSTM-FSRS version where all formulas (such as S for lapses and S for successful reviews) are the same, except that the value of D is given by a neural network (assuming it's possible to optimize such a hybrid in pytorch). This isn't just to satisfy my curiosity; it will answer an important question: "Can we change how the value of D is calculated to improve the accuracy of FSRS without completely changing the formulas themselves?". If RMSE is the same as before, that would indicate that rather than adding more factors and terms to the already existing formula for D, we should completely redefine D. In other words, if even a neural network cannot provide a value of D (between 1 and 10) that improves RMSE, then our attempts won't either.

L-M-Sherlock commented 1 year ago

Why should the neural network be LSTM? I don't know what the input and output are for that LSTM.

Expertium commented 1 year ago

LSTMs are specifically designed for time series. Input: interval lengths and grades, output: a number between 1 and 10, which you then use in formulas for S.

L-M-Sherlock commented 1 year ago

Do you mean that LSTM should output a serie of difficulty for a sequence of intervals and grades. Then, the sequence of difficulty is used as the input of FSRS?

Expertium commented 1 year ago

Yes. The LSTM should replace new_d.

user1823 commented 1 year ago

I tested @hydrogs' version with my collection. @Expertium, the proper way to test it is to insert the following line of code at the same place as described by Sherlock here: https://github.com/open-spaced-repetition/fsrs-optimizer/issues/16#issuecomment-1676281709

%pip install git+https://github.com/hydrogs/fsrs-optimizer@feat/difficulty

For my collection, this change increased the RMSE.

With @hydrogs' version

w = [1.1076, 1.484, 14.4662, 36.9086, 3.8092, 1.6336, 1.2232, 0.0006, 1.7564, 0.1028, 1.1538, 1.4458, 0.1945, 0.7164, 0.01, 0.0, 6.1513, 3.3328, 0.4446, 0.5588, 2.2258]

R-squared: 0.9398
RMSE: 0.0119
MAE: 0.0059

Last rating: 1
R-squared: 0.8459
RMSE: 0.0210
MAE: 0.0097

Last rating: 2
R-squared: -0.4190
RMSE: 0.1176
MAE: 0.1012

Last rating: 3
R-squared: 0.9439
RMSE: 0.0109
MAE: 0.0068

Last rating: 4
R-squared: -312.8104
RMSE: 0.0507
MAE: 0.0497

Original optimizer:

w= [1.1076, 1.484, 14.4662, 36.9086, 4.3393, 1.8149, 2.2085, 0.0, 1.7653, 0.1439, 1.1822, 1.497, 0.1934, 0.707, 0.01, 0.0, 6.0507]

R-squared: 0.9459
RMSE: 0.0111
MAE: 0.0058

Last rating: 1
R-squared: 0.8283
RMSE: 0.0222
MAE: 0.0102

Last rating: 2
R-squared: -0.4489
RMSE: 0.1185
MAE: 0.1003

Last rating: 3
R-squared: 0.9544
RMSE: 0.0097
MAE: 0.0065

Last rating: 4
R-squared: -199.6734
RMSE: 0.0497
MAE: 0.0489

Expertium commented 1 year ago

@L-M-Sherlock just a reminder to implement my idea with LSTM. Input: grades and interval lengths, output: a number between 1 and 10. It should replace the formula for new_d. Why I want this - to see if it's possible to achieve lower RMSE by changing the value of new_d without changing formulas for S. If even an LSTM cannot achieve lower RMSE, that means we need to completely change how difficulty affects S.

L-M-Sherlock commented 1 year ago

OK. But you know, I'm busy developing https://github.com/open-spaced-repetition/fsrs-optimizer-burn

Recently, I have asked woz the details about universal metric: https://supermemopedia.com/wiki/What%27s_the_formula_in_Metrics_for_xxx.csv%3F

I plan to rewrite some evaluation codes. I will consider your idea after that.

user1823 commented 1 year ago

There shouldn't be clustering of heterogeneous reviews at maximum difficulty.

That depends on the material though. If you are learning something that is very hard, why wouldn't most values of D be close to 10?

In that statement, the word "heterogeneous" is important. With the optimized parameters for my collection, just two Again ratings are sufficient to cause the D to rise up to 10 (the max value). Example:

This means that FSRS is not differentiating between cards with 2 Again ratings and, say, 10 Again ratings. Obviously, the intervals for a card with 10 Again ratings should rise much slowly as compared to a card with 2 Again ratings. But, because they have the same difficulty, the intervals rise at equal rates, which is problematic.

Note: I don't expect any action based on this comment at this moment. I posted this just to record my thoughts, which would help while improving the function for D in the future.

L-M-Sherlock commented 1 year ago

Power Difficulty, the old candidate feature for FSRS v4, would solve this problem, because it doesn't have upper limitation for difficulty. But its performance is not good.

https://github.com/open-spaced-repetition/fsrs4anki/blob/main/archive/candidate/power_difficulty.ipynb

Expertium commented 1 year ago

I still think that we should try what I suggested here: https://github.com/open-spaced-repetition/fsrs4anki/issues/352#issuecomment-1678869322, https://github.com/open-spaced-repetition/fsrs4anki/issues/352#issuecomment-1678889261 If an LSTM cannot generate a value between 1 and 10 that, once plugged into existing formulas, would improve RMSE, then that means that we should completely change our definition of D and the formulas. If an LSTM can generate a value that improves RMSE while still using the same (11-D) term in the main formula for S, then we can try to find a simple function that behaves similarly to an LSTM. But honestly, I've tried so many things that I genuinely don't know what could be done. I've tried incorporating R into D in several different ways; I've tried adaptive grades; I've tried more than 20 different modifications; and nothing worked. I wouldn't be surprised if LSTM failed to find better values of D.

Expertium commented 1 year ago

I've been thinking that maybe D is not the bottleneck, in other words, it's not the biggest obstacle on the path to decreasing RMSE. Perhaps trying to improve D doesn't work because something else needs to be improved first. But if that's the case, I don't know what it might be. The main formula (well, two of them) has 3 terms, DSR. If D is not the bottleneck, that means either the way S is incorporated into the main formula needs improvement, or the way R is incorporated needs improvement. But both of them seem fine and I don't see any way to improve them. Then there is also the possibility that something else entirely is the bottleneck (not the main formula that is used for successful reviews and not the formula for lapses either). If that's the case, then I have no idea what it might be. Estimation of initial S has already been improved. Any food for thought is welcome.

filipenanclarez commented 1 year ago

I've been thinking that maybe D is not the bottleneck, in other words, it's not the biggest obstacle on the path to decreasing RMSE. Perhaps trying to improve D doesn't work because something else needs to be improved first. But if that's the case, I don't know what it might be. The main formula (well, two of them) has 3 terms, DSR. If D is not the bottleneck, that means either the way S is incorporated into the main formula needs improvement, or the way R is incorporated needs improvement. But both of them seem fine and I don't see any way to improve them. Then there is also the possibility that something else entirely is the bottleneck (not the main formula that is used for successful reviews and not the formula for lapses either). If that's the case, then I have no idea what it might be. Estimation of initial S has already been improved. Any food for thought is welcome. !

https://github.com/open-spaced-repetition/fsrs4anki/issues/439#issuecomment-1720684595

brubsby commented 1 year ago

Then there is also the possibility that something else entirely is the bottleneck Any food for thought is welcome.

I just want to echo what @filipenanclarez said above, with some more of my own thoughts, A lot of my decks have explicit siblings, which are almost certainly affecting the stability of each other when reviewed, so I imagine there's always going to be some loss one can't reduce due to this alone. Also, many of my cards have what I like to call "conceptual siblings". These being cards which are not explicit siblings, but which, nevertheless, have relationships to other cards, and their review has a graded effect on the stability of other "conceptual siblings". For instance, a deck full of vocabulary, but also example sentences, would have many of these relationships. Perhaps I should separate these out from the vocab, but really, they are affecting my retention of their "conceptual siblings" no matter what deck I place them in.

@L-M-Sherlock mentioned, in a reddit thread I brought this up in, that one might use graph neural nets to begin to broach this problem (eventually), and that seemed possibly tractable to me. But first, lets try to think it through with good ol' graphs, no neurons required. Imagine you create a DAG where every pair of explicit sibling has two directed edges between them with some edge weights, you could then optimize the usual FSRS weights as normal, but also add some logic that makes it so that reviewing a card with an edge on the DAG to the current card improves the current card's stability (and possibly lowers difficulty) by some amount parameterized by the weight of the edge. This would allow "explicit sibling" review effects to be learned per note and cards.

So far this doesn't seem too completely intractable or prone to the curse of dimensionality (famous last words). But I do imagine the "per card" basis might make it more likely to overfit, as training weights per card negates the benefits of the train vs test set (and would be a breaking change to the number of weights, lol). Perhaps it would be required to have an "explicit sibling" weight be learned per deck rather than per card relationship, though I can imagine scenarios where different types of sibling in one deck do not all convey the same stability benefits.

However, next, we might think about also learning the edges and weights representing the directed-memory-strengthening-relationships for the "conceptual siblings". We may even be able to keep the mathematical logic the same as with the explicit siblings, but instead just attempt to add edges and weights that might help the loss. This is, of course easier said than done, and seems like a mathematically unusual problem that might require novel techniques to solve, especially if done in tandem with optimizing the normal FSRS parameters. Trying to use stochastic optimization here would be very slow, as there are n*(n-1)/2 possible edges for a deck of n cards. I'm attempting to learn more about current methods in "Latent Graph Inference", although I think current techniques might require mapping this problem from a plain ol' graph problem into a graph neural net problem, which, doesn't seem entirely impossible now, as graph neural nets seem particularly suited to this. Perhaps it involves thinking of each node in the graph as the card and card's history, and every historical day that passes is a time step for the graph, and every review during that historical day "flows" stability/difficulty information along the GNN (with each node also having a weight 1 edge to itself), I'll have to continue to read more to see if this conception makes any sense. Here's a particularly dense paper I'm working my way through: https://arxiv.org/pdf/2211.16199.pdf

All that to say, I think siblings play a big part in any SRS inaccuracy. As well as the fact that you can never track all offline "retention events" (spontaneous recall in conversation, teaching someone about what you learned, foreign language practice, target language media consumption, etc.), which all seem like they necessarily will introduce a lot of inaccuracies that can't be mathematically modeled. However, siblings do seem like an interesting problem worth attacking eventually, especially if it requires using or creating some novel and fun techniques. :)

Expertium commented 1 year ago

I think there is actually a neural network that outputs a "similarity score" based on the card's text.

@L-M-Sherlock I can't find it, but I think I saw it when browsing your profile on github.

EDIT: https://github.com/thiswillbeyourgithub/AnnA_Anki_neuronal_Appendix

ghost commented 1 year ago

I think that attempting to consider the effects of reviewing related cards is beyond practical use. People would inevitably interact with material they study outside Anki which will affect memory states, e.g. reading a book in a foreign language should change stability of cards with words that were encountered in the reading. I assume that unless someone is deliberately studying something very abstract which has absolutely no practical use in their life, the effects of reviewing sibling cards would be negligible compared to those of applying their skills in practice. These effects are impossible to account for, and instead, the best that could be done is to translate user's subjective perception of how well they know the answer, which comes in form of review grades, to changes in difficulty and stability.

Expertium commented 1 year ago

These effects are impossible to account for

Of course if something happened outside of Anki, it's impossible to account for. But if something happened within Anki - such as reviewing a conceptually similar card - it should be possible to use the information about that event in scheduling.

brubsby commented 1 year ago

I think there is actually a neural network that outputs a "similarity score" based on the card's text.

@L-M-Sherlock I can't find it, but I think I saw it when browsing your profile on github.

EDIT: https://github.com/thiswillbeyourgithub/AnnA_Anki_neuronal_Appendix

This is awesome! I didn't include in my spiel, but did have the thought that maybe some clever similarity metric might be a good place to start for the graph's edge weights, and then perhaps some optimization of the weights for magnitude of the effect, based off the time series data could be useful.

the effects of reviewing sibling cards would be negligible compared to those of applying their skills in practice.

I would agree that the effect of real word practice is likely large and unknowable from the optimizer's point of view (but can likely be broadly optimized as deck ease), but the effect of conceptual neighbors is still very important, and I believe a major "cache" of loss that can be "harvested". Imagine you have a deck that has some accidental duplicates. For instance, I have a large deck about sign language that includes slight variations of specific signs. Some are so close to each other that reviewing one is basically like reviewing the other (perhaps only a slight change in facial expression, but still a difference worth noting). These are surely stored in my brain as the same memory, and reviewing one card is, surely, nearly identical to reviewing the other (with respect to the memory's stability). I would think an optimal system would be able to tease this relationship out, and essentially treat reviewing this one as reviewing the other.

That is perhaps a situation where the weight between the nodes of the graph is 1, (like perhaps the self-edge also has a weight of 1 to represent a review's effect on itself) whereas there are situations where the effect of reviewing a conceptual neighbor is not as marked. Like maybe a certain example sentence uses a specific sign that is similar to a vocab card, but not quite, but it sometimes makes you think of the sign, and stabilizes your memory a bit. This would be a situation where the edge weight is quite low, but nonzero.

This type of all neighbor/conceptual sibling type of optimization seems a lot more prone to the curse of dimensionality, especially if one was trying to optimize individual edge weights. But, I would imagine if you ran the aforementioned "text similarity neural net" once, and then used those edge weights with a few changes to the stability/difficulty mechanism to account for it, and optimized some parameters in that mechanism globally, you would see some improvement in loss for a lot of decks. And if this were the case, I'd think that would be a good sign that optimizing individual edge weights might be beneficial as well. One might just need to have the train and test set be temporally sampled rather than by cards (if this isn't already the case). (i.e. test set is the most recent time slice of the dataset, whereas training set ends before then), as all cards need to be represented in the training set so their edge weights can be established.

Expertium commented 1 year ago

So out of curiosity, I decided to combine the three most promising (on paper) ideas: 1) Optimizable values for grades instead of 1, 2, 3, 4 2) Initial D is determined using four optimizable parameters rather than a formula 3) Surprise function

This introduces 11 new parameters: 4 for each grade, 4 for each value of initial D, and 3 for the surprise function. This has added more flexibility than ever, as well as incorporated R into D. Note that I'm still using (11 - new_d) in the main formula, and I didn't change the formula for lapses either. Everything was tested in 4.0.0 Beta. There was no statistically significant difference in performance, and on average, the RMSE of the version with 11 was almost exactly the same as the RMSE of 4.0.0 Beta. As I've mentioned before, I've tried over 20 modifications of D, none of them worked, including those that changed this part (11 - new_d). I don't see any way to improve D without using neural networks or some kind of extremely complicated math, so I would close this issue. I don't think we will see any improvement here, I believe there are more promising areas for development.

Expertium commented 1 year ago

One more thing that, I think, is worth trying is not excluding same-day reviews but only adjusting difficulty without changing stability. The logic is that if the user pressed "Hard" (or "Again") two or more times when seeing the card on the same day, this card must be more difficult than a card such that the user pressed "Hard" only once. @L-M-Sherlock once you have some free time, I would like you to test this. Do not exclude same-day reviews, but do not recalculate S, only D.

Vilhelm-Ian commented 1 year ago

This is true for sure. Cards that in learning that I have pressed 10 times in the same day again instead of once. Are way harder. I noticed that fsrs needs a lot of time to catch up to those.

thiswillbeyourgithub commented 1 year ago

Hi, I'm the author of AnnA and was brought by @brubsby

I'm a medical student very into datascience. I created AnnA to reduce the amount of reviews I had to do. But along the way I experimented with other use of my vectorization engine : for tagging the K nearest neighbors of each cards, for tagging the cards by the cluster semantics, to create semantic plots etc.

I was not really aware of the potential of FSRS until @brubsby talked me into it and I'm now very interested! I probably won't have time to dig into the whole thread here but am happy to answer any question you might have and can add features if that helps.

I don't have too much but I figured you might be interested in some details about AnnA.

Quick summary of how AnnA works : it's a python script that gathers cards either from ankiconnect or from ankipandas (as ankiconnect is a serious speed bottleneck), use a field_mapping file to know which field of which notetype to take into account, apply a bunch of text processing (basically html stripping, I also supported at some point stemming, tokenizing, probably a few other things), then vectorize each card. The result is a 2D 'distance matrix' for the selected deck.

Vectorization details: I supported several engines but mainly use either tokenization + TFIDF (universal, but not as semantic as I'd like it to be, also support ponderation of the fields) or sbert embeddings (multilingual models have low text size so I did a sort of rolling window + pool)

Perf: t's not actually that intensive to compute the distance matrixexcept for very large decks but then you could use faster function to only find the n closest cards, and don't actually have to rerun the script very often.

In one year from now I'll have finished my exam and plan to make AnnA very easy to use for everyone, (as well as many other of my hidden FOSS Ai Anki projects :) ) but won't have time for now. I know the code is a bit terrible at times but I had to make a tradeoff as I'm sure you can understand.

Cheers !

Expertium commented 1 year ago

@L-M-Sherlock I know this is fairly low on your priority list, but we could enhance "Delay siblings" feature with this.

Gilfaro commented 1 month ago

From my simple tests it looks like how currently D is used in stability_after_success and stability_after_failure promote the optimizer to shift most cards for most users to D=10.

(11 - state[:, 1])

wants to be as close to 1 as possible and

torch.pow(state[:, 1], -self.w[12])

wants to be as close to max to achieve the smallest number with pow

I tried a simple change to rescale D to 0.0-1.0 and this frees the difficulty it seems but the metrics are a bit worse across the board.

open-spaced-repetition / fsrs4anki

[Enhancement] Improving the function for calculating difficulty #352