Compare Anki SM-2 vs FSRS for video

AnKingMed commented 11 months ago

As requested, submitting an issue as a reminder for you. I'd appreciate help with comparing SM-2 to FSRS. I can use the Anki Simulator addon based off my current stats. I think it would be helpful to attempt to showcase that vs FSRS. Image below with what stats I was using for the simulator add-on. Essentially trying to complete the deck (attached below with scheduling) in 1 year. I think it would be helpful to showcase total reviews per day based on 85% retention, vs 90% vs 95%. It would also probably be helpful (but less helpful) to show trying to complete the deck in 6 months vs 1 year vs 2 years to show how the different algorithms would affect things.

Thank you for your help!!

Attachment

Dermki.apkg.zip

L-M-Sherlock commented 11 months ago

I think it would be helpful to showcase total reviews per day

It's not a good stats for the comparison. Forgetting 100 cards costs more time than remembering 100 cards.

@Expertium @user1823, do you have suggestion for the comparison which will show in an introduction video?

AnKingMed commented 11 months ago

It's not a good stats for the comparison. Forgetting 100 cards costs more time than remembering 100 cards.

True but most of my audience are medical students and while they probably should care about time they care about making sure they can finish the deck and most of them estimate effectiveness of the algorithm based on how many reviews they're doing a day (probably because they know roughly how many reviews of that specific deck equates to how long (for me ~250-300 took 1 hour).

Most of these are suspended now but this is my review data for my first three years of medical school with 30k notes. This is the deck that 70%+ of all medical students are using so perhaps this is better for showcasing?

I think its reasonable to convert the view on SM-2 to a time based comparison as well (we could just extrapolate both to # of reviews for example's sake after) AnKing Overhaul for Step 1 & 2.apkg.zip

L-M-Sherlock commented 11 months ago

they care about making sure they can finish the deck and most of them estimate effectiveness of the algorithm based on how many reviews they're doing a day

But they would not accept a very low retention. If they just learn new cards and ignore reviews, they will finish the deck quickly with fewer reviews per day. So we need to find a uncheatable metric to compare FSRS and SM-2.

I have an idea. We can set the limit for New cards/day for SM-2 and FSRS. Then we can compare the finial retention and total cost time.

L-M-Sherlock commented 11 months ago

I think its reasonable to convert the view on SM-2 to a time based comparison as well (we could just extrapolate both to # of reviews for example's sake after)

Do you optimize the weights of FSRS based on this deck? Could you provide the weights?

L-M-Sherlock commented 11 months ago

How about this comparison:

FSRS saves 1 - 1.54/1.93 = 20% time per remembered card.

AnKingMed commented 11 months ago

Do you optimize the weights of FSRS based on this deck? Could you provide the weights?

0.4006, 0.821, 2.4086, 17.6495, 4.7694, 0.7974, 0.9321, 0.0455, 1.7141, 0.1025, 1.1687, 2.1956, 0.0757, 0.3464, 1.3312, 0.2808, 2.8996

AnKingMed commented 11 months ago

FSRS saves 1 - 1.54/1.93 = 20% time per remembered card.

How did you calculate this?

This file Dermki.apkg.zip in my initial post is what I suggested using because we can set new cards per day and select retention. In that example you need to do 24-25 new cards/day in order to complete the deck in 365 days. The retentions I posted in that first pic are pretty close to accurate (~92% mature, 95% young, 78% learning). The optimized weights I just commented are the weights for that deck.

AnKingMed commented 11 months ago

I think it would make sense to run the FSRS algorithm against the stats I already have there and see what comes up. i.e. FSRS at 92% retention, 25 new cards/day. would that work?

L-M-Sherlock commented 11 months ago

OK. I can run comparison based on that deck. But I need the parameters. Or can you provide your timezone and next_day_start_at? I can run the optimizer for you.

AnKingMed commented 11 months ago

eastern time zone (new york), next day start at is 2. These are the parameters I got: 0.4006, 0.821, 2.4086, 17.6495, 4.7694, 0.7974, 0.9321, 0.0455, 1.7141, 0.1025, 1.1687, 2.1956, 0.0757, 0.3464, 1.3312, 0.2808, 2.8996

L-M-Sherlock commented 11 months ago

Expertium commented 11 months ago

I don't think this is fair comparison, since Anki results in higher retention here. Ideally, we should compare Anki and FSRS when both produce very similar retention.

AnKingMed commented 11 months ago

I agree we should hold the retention the same as a control and use total reviews/day and time/day as the variables

L-M-Sherlock commented 11 months ago

You can run the comparison here: https://github.com/open-spaced-repetition/fsrs4anki/blob/main/fsrs4anki_simulator.ipynb

L-M-Sherlock commented 11 months ago

By the way, if the retention is in the same level, the burden would not have significant difference. Instead,I recommend setting the retention to the suggested value, because a strength of FSRS is it could find the optimal retention.

AnKingMed commented 11 months ago

You can run the comparison here: https://github.com/open-spaced-repetition/fsrs4anki/blob/main/fsrs4anki_simulator.ipynb

How do I use this?

I figured the number of reviews would be similar, but set at the same retention, FSRS should theoretically yield less reviews overall because you're able to do slightly less cards and also slightly less re-learning right?

L-M-Sherlock commented 11 months ago

How do I use this?

Just open the notebook in colab, upload your deck file and replace the parameters with yours. Then click Runtime->Run all to execute the simulation.

AnKingMed commented 11 months ago

Ok. Will it let me fix the retention for both and get the graphs like you shared? How do you think would be best to compare these two to convince people how FSRS is more efficient?

L-M-Sherlock commented 11 months ago

It only allows you to fix the retention of FSRS. To change the retention of SM-2, you need to tune the interval modifier.

AnKingMed commented 11 months ago

I think this actually illustrates it pretty well. For the same number of reviews and time spent reviewing, the retention is higher with FSRS. Am I interpreting that correctly?

AnKingMed commented 11 months ago

This doesn't match perfectly, but I tried to match the retention rates and it shows a pretty decent improvement in how much time/reviews a day with FSRS

Expertium commented 11 months ago

Yeah, I think that's good. It shows that with FSRS you can achieve the same retention while doing 20-30% fewer reviews. Also, you can increase this: To make the curves less noisy and see the trend more clearly.

AnKingMed commented 11 months ago

yeah the retention one is less ideal. For the same number of reviews its a ~2-3% increase in retention? I think the second set of graphs is probably better for convincing people FSRS is a better algorithm don't you think?

L-M-Sherlock commented 11 months ago

To increase retention is expensive, particularly when your previous retention has been very high.

AnKingMed commented 11 months ago

That's a good point. I suppose this would be quite a bit more dramatic if I adjusted things. Thanks for the thoughts!

Expertium commented 11 months ago

Yeah, the number of reviews you have to do increases non-linearly with retention. The closer retention is to 1.0, the more changing it will affect your workload.

AnKingMed commented 11 months ago

If I want to show an example at .85 retention with my deck, should I use my parameters from the optimizer? Or just use the default FSRS parameters? I'm not sure I understand how those adjust things

Expertium commented 11 months ago

Good question. I'd say the default parameters, because they are more representative of an average user's memory.

AnKingMed commented 11 months ago

I keep getting this error with this deck. Any idea why?

ValueError Traceback (most recent call last) in <cell line: 106>() 231 card.iat[idx, field_map["lapses"]] = 0 232 --> 233 r, t, p, new_states = student.init() 234 new_stability = float(new_states[0]) 235 new_difficulty = float(new_states[1])

1 frames in generate_rating(review_type) 91 def generate_rating(review_type): 92 if review_type == "new": ---> 93 return np.random.choice([1, 2, 3, 4], p=first_rating_prob) 94 elif review_type == "recall": 95 return np.random.choice([2, 3, 4], p=review_rating_prob)

mtrand.pyx in numpy.random.mtrand.RandomState.choice()

ValueError: probabilities do not sum to 1

AnKing.apkg.zip

L-M-Sherlock commented 11 months ago

It's a bug. I have fixed it in this version: https://github.com/open-spaced-repetition/fsrs4anki/blob/v4.8.2/fsrs4anki_simulator.ipynb

AnKingMed commented 11 months ago

thank you! Is there a way to run the simulator without uploading a file? Or does it need to use some data from something? My retention on all my decks is just already pretty darn high so the difference in comparing the two is not a ton

Expertium commented 11 months ago

It needs your data to estimate these things: Actually, now that I think about one of your previous questions, maybe it's better to compare Anki and FSRS using your parameters and not default parameters, since it doesn't make a lot of sense to use default parameters while using your values of the stuff above.

AnKingMed commented 11 months ago

It's a bug. I have fixed it in this version: https://github.com/open-spaced-repetition/fsrs4anki/blob/v4.8.2/fsrs4anki_simulator.ipynb

@L-M-Sherlock it looks like the optimizer on hugging face is having the same error just FYI

L-M-Sherlock commented 11 months ago

it looks like the optimizer on hugging face is having the same error just FYI

Thanks for the reminder, I have updated the optimizer on hugging face just now.

AnKingMed commented 11 months ago

Ok I'm a bit confused now. I did the comparison at 0.95 which is about where I'm at normally on Anki and it looked decent. Showed that I'd do less cards with FSRS most of the time reviews

Then I switched retention to 0.9 and increased the interval modifier to 2.5 on the Anki settings. I didn't change anything else. Now it shows reviews being about the same and Anki retention actually being higher. I had to increase the interval modifier a huge amount just to decrease my retention that much. Is there something else I should do for this?

review retention

My settings:

# parameters for FSRS
w = [1.6587, 2.4185, 4.2583, 65.4933, 4.353, 2.3728, 2.8213, 0.0001, 2.654, 0.2041, 1.9349, 2.0306, 0.1588, 0.601, 1.0708, 0.0, 3.3191]
requestRetention = 0.90  # recommended setting: 0.8 ~ 0.9

# parameters for Anki
graduatingInterval = 3
easyInterval = 4
easyBonus = 1.5
hardInterval = 1.2
intervalModifier = 2.5
newInterval = 0.2
minimumInterval = 1
leechThreshold = 4
leechSuspend = False

# common parameters
maximumInterval = 36500
new_cards_limits = 45
review_limits = 9999
max_time_limts = 10000
learn_days = 550
deck_size = 25000

# get the true time from review logs
filename = "AnKing Step 1.apkg"

# smooth curves
moving_average_period = 30

# Set it to True if you don't want the optimizer to use the review logs from suspended cards.
filter_out_suspended_cards = False

# Red: 1, Orange: 2, Green: 3, Blue: 4, Pink: 5, Turquoise: 6, Purple: 7
# Set it to [1, 2] if you don't want the optimizer to use the review logs from cards with red or orange flag.
filter_out_flags = []

AnKing Step 1.apkg.zip

Sorry again for all the questions. Just trying to make this really simple for the public so they'll adopt it

AnKingMed commented 11 months ago

Another separate question - there is a reasonable group of people that do not believe in using the 'hard' or 'easy' buttons but as I'm learning more about FSRS, it seems its almost necessary if you want the full benefit of the algorithm. Is that true?

I wonder if at some point it'd be possible to learn how quickly someone answers things and then use that to automatically apply which button should be used... not sure how realistic that is

Expertium commented 11 months ago

it seems its almost necessary if you want the full benefit of the algorithm. Is that true?

In theory, users who don't use all four buttons won't be able to utilize the entire range of values of D. In practice, however, we haven't investigated whether that affects accuracy. @L-M-Sherlock this isn't the first time I've said this. Perhaps you should open a new issue, specifically to investigate the relationship between how often users use certain buttons and RMSE? EDIT: I submitted a new issue for this: https://github.com/open-spaced-repetition/fsrs4anki/issues/498

Expertium commented 11 months ago

Btw, @AnKingMed, I suggest you write down how many reviews, on average, you do per day (for example, write down 7 values for 7 days and then take the average) and your retention, so that you can later compare it to FSRS, to (hopefully) see the benefit with your own eyes.

AnKingMed commented 11 months ago

Btw, @AnKingMed, I suggest you write down how many reviews, on average, you do per day (for example, write down 7 values for 7 days and then take the average) and your retention, so that you can later compare it to FSRS, to (hopefully) see the benefit with your own eyes.

Is 7 days really enough to see a huge difference? Based on the graphs I've shared above, FSRS and SM-2 are almost identical initially

Expertium commented 11 months ago

The more, the better, of course. Also, you've been using Anki a lot, so your number of reviews per day is probably quite stable.

Expertium commented 10 months ago

@AnKingMed once you finish working on your video about FSRS, I would recommend sharing it here (setting the video to "Private" or "Unlisted" on Youtube) to get some feedback before releasing it.

AnKingMed commented 10 months ago

Certainly can do. I won't be able to do a ton of post editing though given my time allowance right now (working 70 hour weeks).

I still don't feel I have a great comparison of the two algorithms. Perhaps you have some decks and could run the comparison?

Expertium commented 10 months ago

Well, it's hard to show this stuff in a way that doesn't feel to abstract. You could show the numbers from the benchmark repo, but people will just be like "oh, this number is lower than the other number...alright". It's too abstract. One way to make it less abstract is to show the calibration graph (the one you get after running the google colab optimizer) for FSRS and for SM-2, to visually demonstrate how much FSRS deviates from theoretical perfection and how much SM-2 does. I will show you two calibration graphs later, tomorrow. But while that is good in my opinion, it may still feel too abstract for the average user. "Oh, so FSRS is better at predicting some probability stuff...alright". So perhaps the best way to show superiority of FSRS is by showing the graphs from the simulator, the ones that show that FSRS gives you fewer reviews for the same level of retention. This is about as "real" as you can get. No abstract math, just "FSRS gives you less work for the same end result".

AnKingMed commented 10 months ago

I agree. My retention is just too high already so the spread isn't super significant. I bet someone with retention at 0.85-90 would get better results on that graph than me

Expertium commented 10 months ago

So here's how I would make a video, feel free to disagree. I tried to make it short and concise, but it still ended up being pretty long.

"Every honest spaced repetition algorithm must be able to predict the probability of recalling a card (R) at a given point in time, given the card's review history. If an algorithm doesn't do that, it cannot determine what interval is optimal. We can assess the accuracy of a spaced repetition algorithm in the following way: group predictions into "bins" based on predicted R, for example, predictions between 1.0 and 0.9. Then, within each bin, we calculate the average R. Then we calculate the average retention within that bin based on the user's review history. Ideally, they should be the same. In other words, if the algorithm predicts that there is a 95% probability that the user will get each of these cards right, the user should get 95% of them right. If the user got 70% of them right, it means the algorithm predicted the probability poorly." "On the x axis, we have the predicted probability of recalling a card, on the y axis, we have the measured probability. The orange line represents a theoretically perfect algorithm, and the blue line represents FSRS. It may not look very impressive, but that's just because I have a very high retention in Anki and I do my reviews very diligently, so most datapoints are concentrated within a narrow range. If you zoom in, you will see that within that range, FSRS provides an almost perfect fit." (note: this means that as long you don't let your cards become overdue and stick to desired retention of 90-97%, FSRS will be extremely accurate for you) "RMSE is a measure of how much the blue line deviates from the orange line, the lower the value, the better. It can be interpreted as "the average difference between predicted and measured R". Now let's look at SM-2, the algorithm that served as the foundation for Anki's default algorithm." "As you can see, the blue line is not aligned with the orange line at all, and RMSE is much higher. Of course, this is just one collection, but thankfully LMSherlock, the creator of FSRS, has benchmarked FSRS on 70+ collections of Anki users. Here are the results:" "FSRS performs very well, even better than a neural network! (granted, that neural network wasn't fine-tuned for spaced repetition, this is just out-of-the-box performance) And here is a comparison between FSRS and SM-17, one of the latest SuperMemo algorithms. Unlike the previous benchmark - which was conducted using Anki users' data - this one was conducted using SuperMemo users' data." "These theoretical findings are fascinating, but I'm sure you are wondering what they mean in practice. Let's look at the results from the simulator, which simulates a review history similar to my own:" "I adjusted Anki settings to roughly match the retention set in FSRS. As you can see, while it's possible to achieve high retention using any of the two algorithms, FSRS will give you 20-30% fewer reviews for the same level of retention! Simply put, you will know just as much while studying less!"

Expertium commented 10 months ago

Here's a link to a version of the optimizer specifically made for testing and analysis: https://colab.research.google.com/drive/1LMtO2iIiVY7iKqOcz-kCuh6spUXLd3t5?usp=sharing

Links to benchmark repos: https://github.com/open-spaced-repetition/fsrs-benchmark https://github.com/open-spaced-repetition/fsrs-vs-sm17

AnKingMed commented 10 months ago

Few questions:

What are LSTM and HLR?
I understand RMSE, but what is RMSE (bins) and log loss? I'm assuming related, but just curious from a technical standpoint so I can explain it better
Out of curiosity, how did @L-M-Sherlock get SM-17 data?
How did you get the graphs with SM RMSE vs using FSRS RMSE? Could I do that with my collection? I haven't used FSRS yet

Expertium commented 10 months ago

LSTM is a type of neural network designed for time-series predictions, and HLR is some Duolingo stuff. I suggest reading the descriptions here https://github.com/open-spaced-repetition/fsrs-benchmark#models.
Ok, this one is kinda confusing. Forget about the RMSE column, RMSE (bins) is what you should actually care about. Here's how it's calculated:

1) Group all predicted values of R into bins. For example, between 1.0 and 0.95, between 0.95 and 0.90, etc.

In the following example, let's group all predictions between 0.8 and 0.9:

Bin 1 (predictions): [0.81, 0.85, 0.87, 0.87, 0.89]

2) For each bin, record the real outcome of a review, either 1 or 0. Again = 0. Hard/Good/Easy = 1. Don't worry, it doesn't mean that whether you pressed Hard, Good, or Easy doesn't affect anything. Grades still matter, just not here.

Bin 1 (real): [0, 1, 1, 1, 1, 1, 1]

3) Calculate the average of all predictions within a bin.

Bin 1 average (predictions) = mean([0.81, 0.85, 0.87, 0.87, 0.89]) = 0.86

4) Calculate the average of all real outcomes.

Bin 1 average (real) = mean([0, 1, 1, 1, 1, 1, 1]) = 0.86

Repeat the above steps for all bins. The choice of the number of bins is arbitrary, I don't remember whether it's 20 or 40 in the benchmark.

5) For each bin calculate the squared difference between the averages of predicted and measured R and weigh it by the number of predictions within that bin.

The final formula looks like this: The interpretation is "average distance between predicted R and measured R". Lower = better, since we want algorithmic predictions to match reality. Log loss is...a thing. I cannot give you an intuitive interpetation of it, forget about it. Just don't look at any columns other than RMSE (bins). RMSE and RMSE (bins) sound similar, but they are calculated in an entirely different way.

Sherlock just asked SuperMemo users to contribute to his research.
https://colab.research.google.com/drive/1LMtO2iIiVY7iKqOcz-kCuh6spUXLd3t5?usp=sharing, see section 4.2, the first graph is the calibration graph for FSRS, and in section 4.4 the first graph is the calibration graph for SM-2.

AnKingMed commented 10 months ago

Thank you! @L-M-Sherlock what's the short version of the history of how this came to be? (the short version you'd like me to share with everyone). My understanding is you became very interested in it after Anki was very useful for you, you then collected X Anki collections and Y supermemo collections, analyzed them and came up with FSRS?

Expertium commented 10 months ago

No, I believe Sherlock collected Supermemo data much later, when FSRS was already developed. As for Anki collections, I'm curious about that too.

open-spaced-repetition / fsrs4anki

Compare Anki SM-2 vs FSRS for video #486