Closed AnKingMed closed 9 months ago
I think it would be helpful to showcase total reviews per day
It's not a good stats for the comparison. Forgetting 100 cards costs more time than remembering 100 cards.
@Expertium @user1823, do you have suggestion for the comparison which will show in an introduction video?
It's not a good stats for the comparison. Forgetting 100 cards costs more time than remembering 100 cards.
True but most of my audience are medical students and while they probably should care about time they care about making sure they can finish the deck and most of them estimate effectiveness of the algorithm based on how many reviews they're doing a day (probably because they know roughly how many reviews of that specific deck equates to how long (for me ~250-300 took 1 hour).
Most of these are suspended now but this is my review data for my first three years of medical school with 30k notes. This is the deck that 70%+ of all medical students are using so perhaps this is better for showcasing?
I think its reasonable to convert the view on SM-2 to a time based comparison as well (we could just extrapolate both to # of reviews for example's sake after) AnKing Overhaul for Step 1 & 2.apkg.zip
they care about making sure they can finish the deck and most of them estimate effectiveness of the algorithm based on how many reviews they're doing a day
But they would not accept a very low retention. If they just learn new cards and ignore reviews, they will finish the deck quickly with fewer reviews per day. So we need to find a uncheatable metric to compare FSRS and SM-2.
I have an idea. We can set the limit for New cards/day for SM-2 and FSRS. Then we can compare the finial retention and total cost time.
I think its reasonable to convert the view on SM-2 to a time based comparison as well (we could just extrapolate both to # of reviews for example's sake after)
Do you optimize the weights of FSRS based on this deck? Could you provide the weights?
How about this comparison:
FSRS saves 1 - 1.54/1.93 = 20% time per remembered card.
Do you optimize the weights of FSRS based on this deck? Could you provide the weights?
0.4006, 0.821, 2.4086, 17.6495, 4.7694, 0.7974, 0.9321, 0.0455, 1.7141, 0.1025, 1.1687, 2.1956, 0.0757, 0.3464, 1.3312, 0.2808, 2.8996
FSRS saves 1 - 1.54/1.93 = 20% time per remembered card.
How did you calculate this?
This file Dermki.apkg.zip in my initial post is what I suggested using because we can set new cards per day and select retention. In that example you need to do 24-25 new cards/day in order to complete the deck in 365 days. The retentions I posted in that first pic are pretty close to accurate (~92% mature, 95% young, 78% learning). The optimized weights I just commented are the weights for that deck.
I think it would make sense to run the FSRS algorithm against the stats I already have there and see what comes up. i.e. FSRS at 92% retention, 25 new cards/day. would that work?
OK. I can run comparison based on that deck. But I need the parameters. Or can you provide your timezone and next_day_start_at? I can run the optimizer for you.
eastern time zone (new york), next day start at is 2. These are the parameters I got: 0.4006, 0.821, 2.4086, 17.6495, 4.7694, 0.7974, 0.9321, 0.0455, 1.7141, 0.1025, 1.1687, 2.1956, 0.0757, 0.3464, 1.3312, 0.2808, 2.8996
I don't think this is fair comparison, since Anki results in higher retention here. Ideally, we should compare Anki and FSRS when both produce very similar retention.
I agree we should hold the retention the same as a control and use total reviews/day and time/day as the variables
You can run the comparison here: https://github.com/open-spaced-repetition/fsrs4anki/blob/main/fsrs4anki_simulator.ipynb
By the way, if the retention is in the same level, the burden would not have significant difference. Instead,I recommend setting the retention to the suggested value, because a strength of FSRS is it could find the optimal retention.
You can run the comparison here: https://github.com/open-spaced-repetition/fsrs4anki/blob/main/fsrs4anki_simulator.ipynb
How do I use this?
I figured the number of reviews would be similar, but set at the same retention, FSRS should theoretically yield less reviews overall because you're able to do slightly less cards and also slightly less re-learning right?
How do I use this?
Just open the notebook in colab, upload your deck file and replace the parameters with yours. Then click Runtime->Run all to execute the simulation.
Ok. Will it let me fix the retention for both and get the graphs like you shared? How do you think would be best to compare these two to convince people how FSRS is more efficient?
It only allows you to fix the retention of FSRS. To change the retention of SM-2, you need to tune the interval modifier.
I think this actually illustrates it pretty well. For the same number of reviews and time spent reviewing, the retention is higher with FSRS. Am I interpreting that correctly?
This doesn't match perfectly, but I tried to match the retention rates and it shows a pretty decent improvement in how much time/reviews a day with FSRS
Yeah, I think that's good. It shows that with FSRS you can achieve the same retention while doing 20-30% fewer reviews. Also, you can increase this: To make the curves less noisy and see the trend more clearly.
yeah the retention one is less ideal. For the same number of reviews its a ~2-3% increase in retention? I think the second set of graphs is probably better for convincing people FSRS is a better algorithm don't you think?
To increase retention is expensive, particularly when your previous retention has been very high.
That's a good point. I suppose this would be quite a bit more dramatic if I adjusted things. Thanks for the thoughts!
Yeah, the number of reviews you have to do increases non-linearly with retention. The closer retention is to 1.0, the more changing it will affect your workload.
If I want to show an example at .85 retention with my deck, should I use my parameters from the optimizer? Or just use the default FSRS parameters? I'm not sure I understand how those adjust things
Good question. I'd say the default parameters, because they are more representative of an average user's memory.
I keep getting this error with this deck. Any idea why?
ValueError Traceback (most recent call last)
in <cell line: 106>() 231 card.iat[idx, field_map["lapses"]] = 0 232 --> 233 r, t, p, new_states = student.init() 234 new_stability = float(new_states[0]) 235 new_difficulty = float(new_states[1]) 1 frames
in generate_rating(review_type) 91 def generate_rating(review_type): 92 if review_type == "new": ---> 93 return np.random.choice([1, 2, 3, 4], p=first_rating_prob) 94 elif review_type == "recall": 95 return np.random.choice([2, 3, 4], p=review_rating_prob) mtrand.pyx in numpy.random.mtrand.RandomState.choice()
ValueError: probabilities do not sum to 1
It's a bug. I have fixed it in this version: https://github.com/open-spaced-repetition/fsrs4anki/blob/v4.8.2/fsrs4anki_simulator.ipynb
thank you! Is there a way to run the simulator without uploading a file? Or does it need to use some data from something? My retention on all my decks is just already pretty darn high so the difference in comparing the two is not a ton
It needs your data to estimate these things: Actually, now that I think about one of your previous questions, maybe it's better to compare Anki and FSRS using your parameters and not default parameters, since it doesn't make a lot of sense to use default parameters while using your values of the stuff above.
It's a bug. I have fixed it in this version: https://github.com/open-spaced-repetition/fsrs4anki/blob/v4.8.2/fsrs4anki_simulator.ipynb
@L-M-Sherlock it looks like the optimizer on hugging face is having the same error just FYI
it looks like the optimizer on hugging face is having the same error just FYI
Thanks for the reminder, I have updated the optimizer on hugging face just now.
Ok I'm a bit confused now. I did the comparison at 0.95 which is about where I'm at normally on Anki and it looked decent. Showed that I'd do less cards with FSRS most of the time
Then I switched retention to 0.9 and increased the interval modifier to 2.5 on the Anki settings. I didn't change anything else. Now it shows reviews being about the same and Anki retention actually being higher. I had to increase the interval modifier a huge amount just to decrease my retention that much. Is there something else I should do for this?
My settings:
# parameters for FSRS
w = [1.6587, 2.4185, 4.2583, 65.4933, 4.353, 2.3728, 2.8213, 0.0001, 2.654, 0.2041, 1.9349, 2.0306, 0.1588, 0.601, 1.0708, 0.0, 3.3191]
requestRetention = 0.90 # recommended setting: 0.8 ~ 0.9
# parameters for Anki
graduatingInterval = 3
easyInterval = 4
easyBonus = 1.5
hardInterval = 1.2
intervalModifier = 2.5
newInterval = 0.2
minimumInterval = 1
leechThreshold = 4
leechSuspend = False
# common parameters
maximumInterval = 36500
new_cards_limits = 45
review_limits = 9999
max_time_limts = 10000
learn_days = 550
deck_size = 25000
# get the true time from review logs
filename = "AnKing Step 1.apkg"
# smooth curves
moving_average_period = 30
# Set it to True if you don't want the optimizer to use the review logs from suspended cards.
filter_out_suspended_cards = False
# Red: 1, Orange: 2, Green: 3, Blue: 4, Pink: 5, Turquoise: 6, Purple: 7
# Set it to [1, 2] if you don't want the optimizer to use the review logs from cards with red or orange flag.
filter_out_flags = []
Sorry again for all the questions. Just trying to make this really simple for the public so they'll adopt it
Another separate question - there is a reasonable group of people that do not believe in using the 'hard' or 'easy' buttons but as I'm learning more about FSRS, it seems its almost necessary if you want the full benefit of the algorithm. Is that true?
I wonder if at some point it'd be possible to learn how quickly someone answers things and then use that to automatically apply which button should be used... not sure how realistic that is
it seems its almost necessary if you want the full benefit of the algorithm. Is that true?
In theory, users who don't use all four buttons won't be able to utilize the entire range of values of D. In practice, however, we haven't investigated whether that affects accuracy. @L-M-Sherlock this isn't the first time I've said this. Perhaps you should open a new issue, specifically to investigate the relationship between how often users use certain buttons and RMSE? EDIT: I submitted a new issue for this: https://github.com/open-spaced-repetition/fsrs4anki/issues/498
Btw, @AnKingMed, I suggest you write down how many reviews, on average, you do per day (for example, write down 7 values for 7 days and then take the average) and your retention, so that you can later compare it to FSRS, to (hopefully) see the benefit with your own eyes.
Btw, @AnKingMed, I suggest you write down how many reviews, on average, you do per day (for example, write down 7 values for 7 days and then take the average) and your retention, so that you can later compare it to FSRS, to (hopefully) see the benefit with your own eyes.
Is 7 days really enough to see a huge difference? Based on the graphs I've shared above, FSRS and SM-2 are almost identical initially
The more, the better, of course. Also, you've been using Anki a lot, so your number of reviews per day is probably quite stable.
@AnKingMed once you finish working on your video about FSRS, I would recommend sharing it here (setting the video to "Private" or "Unlisted" on Youtube) to get some feedback before releasing it.
Certainly can do. I won't be able to do a ton of post editing though given my time allowance right now (working 70 hour weeks).
I still don't feel I have a great comparison of the two algorithms. Perhaps you have some decks and could run the comparison?
Well, it's hard to show this stuff in a way that doesn't feel to abstract. You could show the numbers from the benchmark repo, but people will just be like "oh, this number is lower than the other number...alright". It's too abstract. One way to make it less abstract is to show the calibration graph (the one you get after running the google colab optimizer) for FSRS and for SM-2, to visually demonstrate how much FSRS deviates from theoretical perfection and how much SM-2 does. I will show you two calibration graphs later, tomorrow. But while that is good in my opinion, it may still feel too abstract for the average user. "Oh, so FSRS is better at predicting some probability stuff...alright". So perhaps the best way to show superiority of FSRS is by showing the graphs from the simulator, the ones that show that FSRS gives you fewer reviews for the same level of retention. This is about as "real" as you can get. No abstract math, just "FSRS gives you less work for the same end result".
I agree. My retention is just too high already so the spread isn't super significant. I bet someone with retention at 0.85-90 would get better results on that graph than me
So here's how I would make a video, feel free to disagree. I tried to make it short and concise, but it still ended up being pretty long.
"Every honest spaced repetition algorithm must be able to predict the probability of recalling a card (R) at a given point in time, given the card's review history. If an algorithm doesn't do that, it cannot determine what interval is optimal. We can assess the accuracy of a spaced repetition algorithm in the following way: group predictions into "bins" based on predicted R, for example, predictions between 1.0 and 0.9. Then, within each bin, we calculate the average R. Then we calculate the average retention within that bin based on the user's review history. Ideally, they should be the same. In other words, if the algorithm predicts that there is a 95% probability that the user will get each of these cards right, the user should get 95% of them right. If the user got 70% of them right, it means the algorithm predicted the probability poorly." "On the x axis, we have the predicted probability of recalling a card, on the y axis, we have the measured probability. The orange line represents a theoretically perfect algorithm, and the blue line represents FSRS. It may not look very impressive, but that's just because I have a very high retention in Anki and I do my reviews very diligently, so most datapoints are concentrated within a narrow range. If you zoom in, you will see that within that range, FSRS provides an almost perfect fit." (note: this means that as long you don't let your cards become overdue and stick to desired retention of 90-97%, FSRS will be extremely accurate for you) "RMSE is a measure of how much the blue line deviates from the orange line, the lower the value, the better. It can be interpreted as "the average difference between predicted and measured R". Now let's look at SM-2, the algorithm that served as the foundation for Anki's default algorithm." "As you can see, the blue line is not aligned with the orange line at all, and RMSE is much higher. Of course, this is just one collection, but thankfully LMSherlock, the creator of FSRS, has benchmarked FSRS on 70+ collections of Anki users. Here are the results:" "FSRS performs very well, even better than a neural network! (granted, that neural network wasn't fine-tuned for spaced repetition, this is just out-of-the-box performance) And here is a comparison between FSRS and SM-17, one of the latest SuperMemo algorithms. Unlike the previous benchmark - which was conducted using Anki users' data - this one was conducted using SuperMemo users' data." "These theoretical findings are fascinating, but I'm sure you are wondering what they mean in practice. Let's look at the results from the simulator, which simulates a review history similar to my own:" "I adjusted Anki settings to roughly match the retention set in FSRS. As you can see, while it's possible to achieve high retention using any of the two algorithms, FSRS will give you 20-30% fewer reviews for the same level of retention! Simply put, you will know just as much while studying less!"
Here's a link to a version of the optimizer specifically made for testing and analysis: https://colab.research.google.com/drive/1LMtO2iIiVY7iKqOcz-kCuh6spUXLd3t5?usp=sharing
Links to benchmark repos: https://github.com/open-spaced-repetition/fsrs-benchmark https://github.com/open-spaced-repetition/fsrs-vs-sm17
Few questions:
LSTM
and HLR
? 1) Group all predicted values of R into bins. For example, between 1.0 and 0.95, between 0.95 and 0.90, etc.
In the following example, let's group all predictions between 0.8 and 0.9:
Bin 1 (predictions): [0.81, 0.85, 0.87, 0.87, 0.89]
2) For each bin, record the real outcome of a review, either 1 or 0. Again = 0. Hard/Good/Easy = 1. Don't worry, it doesn't mean that whether you pressed Hard, Good, or Easy doesn't affect anything. Grades still matter, just not here.
Bin 1 (real): [0, 1, 1, 1, 1, 1, 1]
3) Calculate the average of all predictions within a bin.
Bin 1 average (predictions) = mean([0.81, 0.85, 0.87, 0.87, 0.89]) = 0.86
4) Calculate the average of all real outcomes.
Bin 1 average (real) = mean([0, 1, 1, 1, 1, 1, 1]) = 0.86
Repeat the above steps for all bins. The choice of the number of bins is arbitrary, I don't remember whether it's 20 or 40 in the benchmark.
5) For each bin calculate the squared difference between the averages of predicted and measured R and weigh it by the number of predictions within that bin.
The final formula looks like this: The interpretation is "average distance between predicted R and measured R". Lower = better, since we want algorithmic predictions to match reality. Log loss is...a thing. I cannot give you an intuitive interpetation of it, forget about it. Just don't look at any columns other than RMSE (bins). RMSE and RMSE (bins) sound similar, but they are calculated in an entirely different way.
Thank you! @L-M-Sherlock what's the short version of the history of how this came to be? (the short version you'd like me to share with everyone). My understanding is you became very interested in it after Anki was very useful for you, you then collected X Anki collections and Y supermemo collections, analyzed them and came up with FSRS?
No, I believe Sherlock collected Supermemo data much later, when FSRS was already developed. As for Anki collections, I'm curious about that too.
As requested, submitting an issue as a reminder for you. I'd appreciate help with comparing SM-2 to FSRS. I can use the Anki Simulator addon based off my current stats. I think it would be helpful to attempt to showcase that vs FSRS. Image below with what stats I was using for the simulator add-on. Essentially trying to complete the deck (attached below with scheduling) in 1 year. I think it would be helpful to showcase total reviews per day based on 85% retention, vs 90% vs 95%. It would also probably be helpful (but less helpful) to show trying to complete the deck in 6 months vs 1 year vs 2 years to show how the different algorithms would affect things.
Thank you for your help!!
Dermki.apkg.zip