[Feature Request] Improving the algorithm, continuation

Expertium commented 1 year ago

Which module is related to your feature request? Scheduler and Optimizer.

Is your feature request related to a problem? Please describe. This issue will be used to share ideas regarding further improvement of the algorithm. Previous issue has accumulated too many comments.

Describe the solution you'd like

We still haven't done a proper comparison of 4.0 Alpha with the power curve vs with the exponential curve. I would like to test those 2 side-by-side, I need code for that.
LOF outlier detection can be improved - use 2 dimensions (n reviews and delta t) instead of 1 and use log-transform before feeding the data to LOF.
Not related to accuracy, but Sherlock, please check the code for typos. I'm pretty sure you have "B-M Metric" instead of "B-W Metric" in several places.
While optimizing, select parameters that correspond to the best epoch, not the final epoch. Best = lowest average of test and train loss. On the image below the lowest average was achieved before the final epoch, hence the set of parameters that corresponds to that particular epoch should have been selected.

For now recall matrix is the most promising change that will greatly affect the accuracy. However, it treats the symptoms, not the cause of the illness. We still have to figure out why FSRS performs poorly for some users and how to fix that. Removing re-learning steps only helped a little bit.

Additional context Don't forget to check other issues. For example, I have an issue related to changing how cards are postponed using the Helper add-on.

Expertium commented 1 year ago

Also, I obtained this when using the parametrized curve, which had RMSE=1.65%, while Anki had RMSE>13%. So the difference in RMSE made it very clear which algorithm is superior. But on this chart it's not clear at all. Even when I compare the best FSRS version with Anki, this graph still makes them look about the same. I think this new graph is just too confusing. 243487975-24eeb4fe-001a-457f-8b3f-288e04a4cdb1

L-M-Sherlock commented 1 year ago

But on this chart it's not clear at all. Even when I compare the best FSRS version with Anki, this graph still makes them look about the same.

I think it is clear. For example, B-W = 0.1 means the algorithm overestimate the retention. If the real retention is 80%, Anki will predict 90%. In other words, the forgetting index is underestimated by Anki. The real forgetting index is 20%, but Anki predicts 10%. The number of cards forgotten by the user will double.

L-M-Sherlock commented 1 year ago

We still haven't done a proper comparison of 4.0 Alpha with the power curve vs with the exponential curve. I would like to test those 2 side-by-side, I need code for that.

OK, I will implement it later.

LOF outlier detection can be improved - use 2 dimensions (n reviews and delta t) instead of 1 and use log-transform before feeding the data to LOF.

I'm detecting the weakness of FSRS via R-Matrix. I will consider it later.

Not related to accuracy, but Sherlock, please check the code for typos. I'm pretty sure you have "B-M Metric" instead of "B-W Metric" in several places.

I have fixed typo in the latest optimizer in last night.

4. While optimizing, select parameters that correspond to the best epoch, not the final epoch. Best = lowest average of test and train loss. On the image below the lowest average was achieved before the final epoch, hence the set of parameters that corresponds to that particular epoch should have been selected.

I think to increase the num of epoch could solve it?

Expertium commented 1 year ago

I think to increase the num of epoch could solve it?

The same reasoning still applies - the final epoch is not necessarily the best.

L-M-Sherlock commented 1 year ago

The same reasoning still applies - the final epoch is not necessarily the best.

Fine. I will try.

By the way, I find that the current formula for the initial stability is not good. The initial stability is no-linear to first rating. The initial stability for again and hard would be closer. But the initial stability for easy would be very large. I plan to use exp() to wrap self.w[0] + self.w[1] * (X[:,1] - 1). What do you think of?

Expertium commented 1 year ago

I still think that since there are 4 possible outcomes, 4 parameters are necessary. But if you don't want to do that, then yes, your idea could work.

L-M-Sherlock commented 1 year ago

I still think that since there are 4 possible outcomes, 4 parameters are necessary. But if you don't want to do that, then yes, your idea could work.

I have tried the 4 parameters version. But the splits will mess up it. And as you mention in https://github.com/open-spaced-repetition/fsrs4anki/issues/275#issuecomment-1571830036, some ratings would never be used in the first learning. The 4 parameters version also couldn't deal with it.

user1823 commented 1 year ago

In my opinion, the number of times the user presses Again on the first day is an important indicator of the difficulty of the card. So, this should affect the initial difficulty and the initial stability of the card. Currently, the initial difficulty and stability is only based on the very first rating. This is not good in my opinion.

user1823 commented 1 year ago

Output RMSE(Anki) - RMSE(FSRS) A positive value would indicate that Anki has a larger RMSE, negative value would indicate that Anki has a lower RMSE This is what Woz does: https://supermemo.guru/wiki/R-Metric
R_Metric = math.sqrt(mean_squared_error(cross_comparison['y'], cross_comparison['anki_p'])) - math.sqrt(mean_squared_error(cross_comparison['y'], cross_comparison['p']))
print("R_Metric: ", R_Metric)
Do you think it's your need?

@L-M-Sherlock, I think that just the difference is not very useful. We should also print the mean_squared_error for Anki and FSRS separately.

Expertium commented 1 year ago

@user1823 here is new code without that metric and with a different metric

plt.figure(figsize=(6, 6))

from matplotlib.ticker import MultipleLocator
ax = plt.gca()
ax.xaxis.set_major_locator(MultipleLocator(0.1))
ax.yaxis.set_major_locator(MultipleLocator(0.1))

cross_comparison['Anki_B-W'] = cross_comparison['anki_p'] - cross_comparison['y']
cross_comparison['FSRS_B-W'] = cross_comparison['p'] - cross_comparison['y']
cross_comparison['FSRS_bin'] = cross_comparison['p'].map(lambda x: round(2 * x, 1) / 2)
cross_comparison['Anki_bin'] = cross_comparison['anki_p'].map(lambda x: round(2 * x, 1) / 2)

plt.axhline(y = 0.0, color = 'black', linestyle = '-')

cross_comparison_group = cross_comparison.groupby(by='Anki_bin').agg({'y': ['mean'], 'FSRS_B-W': ['mean'], 'p': ['mean', 'count']})
print(f"Universal Metric of FSRS: {mean_squared_error(cross_comparison_group['y', 'mean'], cross_comparison_group['p', 'mean'], sample_weight=cross_comparison_group['p', 'count'], squared=False):.4f}")
cross_comparison_group['p', 'percent'] = cross_comparison_group['p', 'count'] / cross_comparison_group['p', 'count'].sum()
plt.scatter(cross_comparison_group.index, cross_comparison_group['FSRS_B-W', 'mean'], s=cross_comparison_group['p', 'percent'] * 1024, alpha=0.5)
plt.plot(cross_comparison_group['FSRS_B-W', 'mean'], label='FSRS by Anki')

cross_comparison_group = cross_comparison.groupby(by='FSRS_bin').agg({'y': ['mean'], 'Anki_B-W': ['mean'], 'anki_p': ['mean', 'count']})
print(f"Universal Metric of Anki: {mean_squared_error(cross_comparison_group['y', 'mean'], cross_comparison_group['anki_p', 'mean'], sample_weight=cross_comparison_group['anki_p', 'count'], squared=False):.4f}")
cross_comparison_group['anki_p', 'percent'] = cross_comparison_group['anki_p', 'count'] / cross_comparison_group['anki_p', 'count'].sum()
plt.scatter(cross_comparison_group.index, cross_comparison_group['Anki_B-W', 'mean'], s=cross_comparison_group['anki_p', 'percent'] * 1024, alpha=0.5)
plt.plot(cross_comparison_group['Anki_B-W', 'mean'], label='Anki by FSRS')

plt.legend(loc='lower center')
plt.grid(linestyle='--')
plt.title("Anki vs. FSRS")
plt.xlabel('Predicted R')
plt.ylabel('B-W Metric')
plt.xlim(0, 1)
plt.xticks(np.arange(0, 1.1, 0.1))
plt.show()

Expertium commented 1 year ago

I tested this:

new_s = torch.exp(self.w[0] + self.w[1] * (X_rating - self.w[14]))
w[0] = w[0].clamp(-3, 3)
w[1] = w[1].clamp(0.1, 5)

Overall, it did not improve performance. However, I noticed an interesting thing. The two decks that benefited the most from this change were both the easiest decks in my collection. RMSE went down by 66% (in other words, down to 1/3 of its previous value) for one of them, and by 37% for the other. Other than those decks, effects for other decks were pretty much random. For Sherlock and user1823 RMSE became slightly lower.

I would love to have more data. @L-M-Sherlock I might be asking for too much, but would you mind sharing all the collections you have received via the Google Form? You can make all fields empty (like user1823 did), I only need the repetition history. If I want to test changes as properly as possible, then I need as much data as I can possibly get.

L-M-Sherlock commented 1 year ago

You can make all fields empty (like user1823 did), I only need the repetition history.

I don't know how he did that. @user1823, could you let me know how to make all fields empty?

L-M-Sherlock commented 1 year ago

4. While optimizing, select parameters that correspond to the best epoch, not the final epoch. Best = lowest average of test and train loss. On the image below the lowest average was achieved before the final epoch, hence the set of parameters that corresponds to that particular epoch should have been selected.

I have implemented it here: https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/experiment/mini-batch.ipynb

L-M-Sherlock commented 1 year ago

We still haven't done a proper comparison of 4.0 Alpha with the power curve vs with the exponential curve. I would like to test those 2 side-by-side, I need code for that.

I implemented it here: https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/experiment/fsrs4anki_optimizer_alpha.ipynb

You can control the shape of forgetting curve via this line of code:

Expertium commented 1 year ago

This isn't quite what I wanted. I wanted to choose the best epoch based on the average of the test and train loss (ideally, a weighted average with the number of datapoints being used as weights). To my surprise, it still performed well, and the result is statistically significant with p<0.01. Note that I copy-pasted the relevant parts of the code to 4.0 Alpha instead of using mini-batch.ipynb.

I recommend adding this to 4.0 Alpha and further fine-tuning it by using a weighted average of test loss and train loss to find the "best loss".

Expertium commented 1 year ago

I tested exponential and power curves, exponential curve performed much worse. Relying on Woz's experience is generally a good idea, but I think in this case we should just stick to the power curve and close issue 274. Yes, using a power curve would deviate from theory for cards that are perfectly segregated by difficulty, but in practice it makes results better without any downsides, so let's use it. While it's possible that using power curve for the first interval and the exponential curve for subsequent intervals will result in better performance than using only one curve, I find it extremely unlikely.

L-M-Sherlock commented 1 year ago

I wanted to choose the best epoch based on the average of the test and train loss (ideally, a weighted average with the number of datapoints being used as weights).

I have implemented it just now. You can check the mini-batch.ipynb.

Expertium commented 1 year ago

I am surprised once again: the weighted loss version performed only marginally better than the previous version that only used the test loss. In any case, this should become the default for future versions of the optimizer.

user1823 commented 1 year ago

You can make all fields empty (like user1823 did), I only need the repetition history.

I don't know how he did that. @user1823, could you let me know how to make all fields empty?

Just use the Find and Replace feature in Anki. Use (.|\n)* in the Find field and keep the Replace With field empty. Be sure to check (✓) the "Treat input as regular expression" option.

Expertium commented 1 year ago

Out of curiosity, I decided to run this new optimizer with 50 epochs. This is the most bizarre pattern I could imagine (in the context of machine learning). Train set loss going down or staying the same while test loss goes up is a textbook example of overfitting. The model learns train data better and better while losing the ability to generalize. But this is the exact opposite: test set loss goes down while train set loss goes up. Either the labels (I checked, they aren't) are swapped or something very strange is going on.

L-M-Sherlock commented 1 year ago

In any case, this should become the default for future versions of the optimizer.

I added it in the latest release: https://github.com/open-spaced-repetition/fsrs4anki/releases/tag/v3.21.3

But this is the exact opposite: test set loss goes down while train set loss goes up. Either the labels (I checked, they aren't) are swapped or something very strange is going on.

Interesting. Does it happen on all splits?

Expertium commented 1 year ago

Interesting. Does it happen on all splits?

No. If I remember correctly, I've only seen it on one split.

L-M-Sherlock commented 1 year ago

No. If I remember correctly, I've only seen it on one split.

Maybe the splits have very different distribution of data. Or, there are many outliers in the trainset.

L-M-Sherlock commented 1 year ago

I speed up the section 1.2 in the mini-batch version. It saves 70% time for me. Could you test it?

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/experiment/mini-batch.ipynb

Expertium commented 1 year ago

4.0 Alpha

mini-batch

L-M-Sherlock commented 1 year ago

Do you record the total time costed by section 1.2?

Expertium commented 1 year ago

You mean with a timer? No. I thought all you needed was to sum up the individual segments.

Expertium commented 1 year ago

Ok, I measured it with a smartphone timer. This took almost exactly 60 seconds.

This took 9 seconds.

So yes, new version is faster.

L-M-Sherlock commented 1 year ago

The notebook has its own timer. And I refactor the first and second segments (the main bottlenecks), so they don’t have processing bar.

L-M-Sherlock commented 1 year ago

So yes, new version is faster.

Nice. I will pick it to the main branch tomorrow.

Expertium commented 1 year ago

The notebook has its own timer

My bad. Here

Expertium commented 1 year ago

@L-M-Sherlock I think you closed the issue related to postponing/advancing too early. You still haven't implemented deck priority, and even if you don't want to do that, there is still another thing: displaying the number of cards that can be postponed or advanced.

L-M-Sherlock commented 1 year ago

I think you closed the issue related to postponing/advancing too early. You still haven't implemented deck priority, and even if you don't want to do that, there is still another thing: displaying the number of cards that can be postponed or advanced.

It is tracked in https://github.com/open-spaced-repetition/fsrs4anki-helper/issues/119

Expertium commented 1 year ago

Out of curiosity, I combined the four things that have helped me so far: 1) Removing re-learning steps completely df = df.drop(df[(df['type'] == 2)].index) 2) Weighted R using grade-derived R and theoretical R 3) New "smart" epochs with weighted loss 4) new_s = torch.exp(self.w[0] + self.w[1] * (X_rating - self.w[14])). It helps a lot for a few decks and doesn't really affect performance or makes it slightly worse for other decks, so I decided to include it, hoping that it will make a big difference for 1-2 decks.

Currently, this is the second-most accurate version. The most accurate one is the one with the parametrized curve, and it's winning by a wide margin.

What's interesting is that, on average, log-loss barely changes while RMSE goes down. Perhaps we need to rethink our optimization process. I believe in SuperMemo the squared difference between predicted R and the corresponding recall matrix entry is used as an optimization criteria. But that requires implementing the recall matrix first. And I'm not sure if using that as an optimization criteria is even possible in pytorch.

Expertium commented 1 year ago

Is it possible to define a custom loss in this version? Previous version had def loss() or something like that, but here it's just a built-in loss, nn.BCELoss().

L-M-Sherlock commented 1 year ago

Is it possible to define a custom loss in this version? Previous version had def loss() or something like that, but here it's just a built-in loss, nn.BCELoss().

Of course. What loss function do you want to define?

Expertium commented 1 year ago

"Woz loss", -log(1-abs(x-y)). It behaves exactly the same as binary cross-entropy loss (aka log-loss) when one of the variables (x or y) is between 0 and 1 and the other one is strictly 0 or 1, no in-between. In other words, if you replace BCELoss with "Woz loss" right now, it will do the same thing. The advantage is that it behaves much more intuitively when both x and y are between 0 and 1. I see two scenarios where this could be helpful: 1) Using "soft" labels (any number between 0 and 1) for Hard and Good instead of "hard" labels (just 1 or 0). I want to try to label Again as 0, Hard and Good as some number between 0 and 1, and Easy as 1, based on my predicted R-grade chart. Though I tried it in v3.17.1 and it only made perfromance worse. I ran an optimization with "hard" labels, then replaced them with "soft" labels. 2) Once the recall matrix is implemented, we can use the Woz loss to minimize the difference between theoretical R and the corresponding matrix entry. I would be very surprised if that failed.

Expertium commented 1 year ago

@L-M-Sherlock I have an idea how to make the calibration graph a little nicer using the data from the recall matrix. It's something that can be implemented right now (by the way, when do you plan to start working on implementing the recall matrix as part of the algorithm rather than just a summary statistic?). It's a little cumbersome to explain how exactly it works, so here is an Excel spreadsheet. Calibration.graph.moving.average.xlsx

1) Take all entries of the R-matrix with count > 0 2) Sort them by R_t (we don't need S_rounded and D_rounded) 3) Use a bi-directional moving average to smooth both R_t and measured R. Check the spreadsheet to see the exact implementation. Notice that the formulas are a little different at the beginning and at the end of the table due to how the bi-directional average is defined. I used a thick border (at the beginning and at the end of the table) to mark where formulas become different.

There are two reasons why I want this instead of the current version with bins: 1) It makes the graph look less jagged in the low predicted R range. Also, it's better to have the blue line cut off due to insufficient data than to leave it and create a false impression of inaccuracy. 2) It makes it more comparable with Woz's graph for SM-17, since that's the method he uses instead of bins.

Note that it doesn't solve the problem with choosing the number of bins (which is arbitrary), it just replaces it with a similar problem: choosing the length of the moving average (which is also arbitrary). Calibration graph (bins vs bi-directional trailing average)

L-M-Sherlock commented 1 year ago

"Woz loss", -log(1-abs(x-y)). It behaves exactly the same as binary cross-entropy loss (aka log-loss) when one of the variables (x or y) is between 0 and 1 and the other one is strictly 0 or 1, no in-between.

Actually, BCELoss supports y (targets) between 0 and 1.

For details, please see: https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html

So we don't need to replace it.

by the way, when do you plan to start working on implementing the recall matrix as part of the algorithm rather than just a summary statistic?

I will implement it after the work in auto-dispersing is done.

Expertium commented 1 year ago

Actually, BCELoss supports y (targets) between 0 and 1.

It supports them, but it behaves weirdly. Here is what I mean: See how when I replaced ones with 0.9, BCE went up even though predictions now match reality more accurately?

L-M-Sherlock commented 1 year ago

See how when I replaced ones with 0.9, BCE went up even though predictions now match reality more accurately?

You can't compare them in this way. You should keep targets same and compare the losses between different predictions.

For example, with BCELoss(x, y), BCELoss(0.9, 0.9) = 0.325 < BCELoss(0.8, 0.9) = 0.362.

L-M-Sherlock commented 1 year ago

It makes it more comparable with Woz's graph for SM-17, since that's the method he uses instead of bins.

Could you provide the source?

Expertium commented 1 year ago

Could you provide the source?

https://supermemo.guru/wiki/Universal_metric_for_cross-comparison_of_spaced_repetition_algorithms By the way, I fixed some stuff in the excel file above. Also, it doesn't seem like Woz runs his average on R-matrix entries, his curve is too smooth. I don't know, maybe he runs it on raw (1 or 0 for y) data? Well, my idea is still closer to what Woz has than the current implementation with bins. EDIT: nevermind, I don't think we should do this. With the new implementation it's very unclear how to calculate RMSE and R^2, specifically, how to weigh the values.

You can't compare them in this way. You should keep targets same and compare the losses between different predictions.

Woz loss is still more intuitive. Here's another example: Even when both y (real data) and p (predictions) are exactly the same, BCE≠0, while Woz loss=0, just like you would expect. According to Woz, "it is the perfect metric function such that it is minimized when the predicted probability distribution (B1, B2, B3, etc.) matches the actual probability distribution (W1, W2, W3, etc.)."

Expertium commented 1 year ago

Meanwhile I tried replacing torch.pow(s, -self.w[7]) with torch.exp(-self.w[7] * s), out of curiosity. It made RMSE worse.

Expertium commented 1 year ago

I finally figured out how to implement the Woz Loss: Section 2.1, only the important parts:

class WozLoss(nn.Module):

    def __init__(self):
        super(WozLoss, self).__init__()

    def forward(self, predictions, labels):
        loss_soft = -torch.log(1-torch.abs(predictions-labels))
        loss_soft = loss_soft.clamp(0, 100)

        labels_hard = torch.ceil(labels)
        loss_hard = -torch.log(1-torch.abs(predictions-labels_hard))
        loss_hard = loss_hard.clamp(0, 100)

        soft_weight = 0.05
        loss_weighted = soft_weight * loss_soft + (1-soft_weight) * loss_hard
        return loss_weighted.sum()

self.loss_fn = WozLoss()

Section 4.1 dataset['log_loss'] = dataset.apply(lambda row: - np.log(1-np.abs(row['p'] - row['y'])), axis=1)

Section 4.2 dataset['y'] = dataset['y'].map(lambda x: math.ceil(x)) This is necessary in case you are using "soft" labels for Hard and Good to convert them back to 1. Also, I made the loss hybrid so that you can use a weighted average of the loss with "soft" and "hard" labels. When soft_weight = 0 you just get the exact same behavior as if you used "hard" labels: 0 for Again, 1 for everything else.

Now the big question is, where do we get accurate values of R that correspond to Hard and Good? I decided to use the parametrized curve as a "teacher" because it's the most accurate one. I would optimize it on some deck, record the values of R corresponding to each grade, and then plug them into 4.0 Alpha. Example of using the parametrized curve to find which value of predicted R corresponds to which grade (Again = 1, Hard = 2, Good = 3, Easy = 4). Unfortunately, it only makes RMSE worse for any value of soft_weight. For values of soft_weight around 0.015-0.05 it can make RMSE very very slightly better, but it doesn't matter. Either the values from the "teacher" aren't accurate enough, or this entire approach is not good. I didnt test it properly because it requires too many tests - I have to train the "teacher" model, then copy-paste values for Hard and Good to the "student" (no parametrized curve) model, and on top of that, I have to try different values of soft_weight. One way to solve this problem (assuming this approach has any merit at all) is to use measured R instead of predicted R. Basically, all I need is a recall matrix, but with an extra column: average grade. Then I can more accurately find the average R for different grades. That way, instead of training the "teacher" model and then the "student" model, I can train the student model twice: first with "hard" labels, then with "soft" labels. And even if this approach doesn't work at all, we still need WozLoss for the future, to try and optimize the algorithm based on the deviation between predicted R and measured R in the R-matrix.

L-M-Sherlock commented 1 year ago

I think, "soft_label" makes the idea of retrievability itself rather strange. Our current calculation of average retrievability is the mean value of 'y'. If we have a set of data where all user ratings are 'good', the average retrievability is only 0.985. This, in turn, shifts the meaning of memory stability. The forgetting curve reflects the proportion of 'again' ratings over time, but it doesn't reflect the change in proportions of 'hard', 'good', and 'easy'.

Expertium commented 1 year ago

"Soft" labels are only used for training the model. For everything else - like plotting the calibration graph - you can convert them back to 0 and 1. dataset['y'] = dataset['y'].map(lambda x: math.ceil(x)) Granted, so far trying to use this idea didn't improve anything.

Expertium commented 1 year ago

I tested whether changing the range of D (by changing d_max) affects performance. It seems that beyond 10, there is no improvement (20 is the baseline).

Interestingly, setting d_max to 5 did almost nothing for my decks (and collection), but greatly worsened RMSE for Sherlock and user1823. For example, for user1823, RMSE went from 0.0116 to 0.0244. This is a problem with my testing: due to a lack of data, I have to use a lot of my own decks but only two collections from other people, making the results biased. I really need more collections from other users in order to conduct more accurate tests. I think we should keep maximum D capped at 10. Increasing it won't do much, it seems, and it's better to keep it consistent between different versions.

qerne130 commented 1 year ago

Hi, how do I check if fsrs is enabled? It would be quite bad if I spent a lot of time reviewing cards and it turned out I made some error while installing fsrs and it wasn't enabled at all.

Expertium commented 1 year ago

@N3Q1E2 find this in the code of the scheduler (the one that you copy-paste in a window):

// FSRS supports displaying memory states of cards.
// Enable it for debugging if you encounter something wrong.
const display_memory_state = true;

Make sure that const display_memory_state is true, then restart Anki. Now you will see something like this:

open-spaced-repetition / fsrs4anki

[Feature Request] Improving the algorithm, continuation #282