Large backlog after update to FSRS v4.5

user1823 commented 11 months ago

I updated to Anki 23.12.1 and reoptimized my FSRS parameters. It gave me the following parameters:

1.2469, 3.8350, 22.0402, 45.3525, 4.7622, 1.4381, 1.5550, 0.0088, 1.7277, 0.1756, 1.0779, 1.8094, 0.1014, 0.5958, 0.5417, 0.0004, 4.0000
Log loss: 0.2120, RMSE(bins): 1.19%.

(The actual value of w[3] was 35 but I increased it to 45 manually. This decreased the log loss and RMSE. Anyway, it doesn't explain the following observations.)

Then, I rescheduled all my cards and got a backlog of 2300+ cards!

Future Due Graph:

Then, I decided to download Anki 23.10.1 and see what parameters it produces. It gave me the following parameters:

1.2452, 1.8249, 19.2423, 61.2210, 4.7892, 1.3961, 1.9008, 0.0193, 1.7643, 0.1371, 1.1227, 1.9802, 0.1377, 0.6087, 0.3046, 0.0093, 6.1882
Log loss: 0.2127, RMSE(bins): 1.31%.

Then, I rescheduled all my cards and only about 830 cards were due (out of which 400 were already due, even before I started doing any of the above).

Future Due Graph:

So, FSRS v4.5 decreased the log loss and RMSE. However, it gave me a huge backlog (FSRS has never given me such a huge backlog till now; even when I switched to FSRS from SM-2, it gave me a backlog of about 900 cards.)

Also, my true retention has not been very different from my desired retention (0.94).

So, is it possible that there is some issue with the new algorithm?

If you need my deck, here it is: Test.zip (change file extension to .apkg)

L-M-Sherlock commented 11 months ago

My guess is that the forgetting curve's shape is changed in FSRS-4.5. The new curve is sharper for R > 90% and flatter for R < 90%. In your case, the desired retention is 94%, which is higher than 90%, so FSRS-4.5 will schedule a shorter interval than before.

user1823 commented 11 months ago

But, if the true retention was close to 0.94 earlier, the new curve should calculate a higher stability than before.

So, the effects should cancel each other and the intervals should be roughly the same.

L-M-Sherlock commented 11 months ago

The actual value of w[3] was 35 but I increased it to 45 manually.

What about tuning it to 60? And did you press easy frequently? The w[16] seems to reach its upper limit.

user1823 commented 11 months ago

And did you press easy frequently?

At first ratings, I have used Easy in about 300 cards.
In the learning phase, I have used it about 500 times.
In young cards, 7 times.
In mature cards, never used Easy.

The total number of reviews is greater than 100k. So, I don't think that this explains the situation.

What about tuning it to 60?

Changing the w[3] to 60 and w[16] to 7 decreased the total number of due cards by only about 10 cards.

user1823 commented 11 months ago

I have a strong feeling that the new algorithm is less accurate for me.

As I said earlier, re-optimizing the parameters with the 23.10.1 also gave me a backlog, though not as big as 23.12.1 gave me. So, I decided to work through the backlog given by reoptimizing with 23.10.1. As I was working through the backlog, I noted that I was getting almost every card correct (with Review Sort Order set to "Relative Overdueness"). So, this meant that the new parameters were suboptimal.

After doing >200 reviews, I optimized again in 23.10.1 and my due count fell from 1000 to 100. However, reoptimizing with 23.12.1 is still giving me 2000+ due cards. So, I think that it is less accurate for me. As for the log loss and RMSE, they change so much between the optimizations that I don't think we should rely on them for my collection.

Metrics (all evaluations done on today's collection):

Using v4 parameters obtained day before yesterday: Log loss: 0.2115, RMSE(bins): 1.10%
Using v4 parameters obtained today: Log loss: 0.2146, RMSE(bins): 1.63%
Using v4.5 parameters obtained today: Log loss: 0.2120, RMSE(bins): 1.23%.

L-M-Sherlock commented 11 months ago

According my analysis of the benchmark results, FSRS-4.5 is more accurate than FSRSv4 for 81.7% collections.

from pathlib import Path
import json
import numpy as np

model = "FSRSv4"
metric = "RMSE(bins)"
m1 = []

result_dir = Path(f"./result/{model}")
result_files = list(result_dir.glob("*.json"))
result_files.sort(key=lambda x: int(x.stem), reverse=False)
for result_file in result_files:
    with open(result_file, "r") as f:
        result = json.load(f)
        m1.append(result[model][metric])

print(np.mean(m1))
model = "FSRS-4.5"
metric = "RMSE(bins)"
m2 = []

result_dir = Path(f"./result/{model}")
result_files = list(result_dir.glob("*.json"))
result_files.sort(key=lambda x: int(x.stem), reverse=False)
for result_file in result_files:
    with open(result_file, "r") as f:
        result = json.load(f)
        m2.append(result[model][metric])

print(np.mean(m2))
better = 0

for (x, y) in zip(m1, m2):
    if y < x:
        better += 1

print(better)
print(len(m1))
print(better / len(m1))

user1823 commented 11 months ago

Then, maybe some other change like https://github.com/open-spaced-repetition/fsrs-rs/commit/a9cc36a207e8861a4b7a383b9d3fae4b9d74c2b8 is the cause.

L-M-Sherlock commented 11 months ago

In the comparison between FSRS-rs and FSRS v4, the percentage is 71.3%.

user1823 commented 11 months ago

I think that you need to analyse things more deeply in order to find the issue. If you don't have enough time to do so now, it's fine. But, it would be great if you perform a proper analysis whenever you have the time.

For now, I am sticking to FSRS v4.

L-M-Sherlock commented 11 months ago

Do you try the python optimizer? Is it only related to the Rust optimizer?

user1823 commented 11 months ago

Py Optimizer v4.19.2	Py Optimizer v4.20.4
w = 1.2187, 1.8588, 18.5804, 65.7669, 4.3881, 1.7984, 2.0913, 0.0, 1.7866, 0.1608, 1.2135, 1.4601, 0.1772, 0.6982, 0.0114, 0.0, 4.0	w = 1.2109, 2.0110, 21.5144, 35.2091, 4.4859, 1.7484, 2.1434, 0.0, 1.8002, 0.1564, 1.1763, 1.3683, 0.1759, 0.7184, 0.0118, 0.0, 4.0
Log loss: 0.2110, RMSE(bins): 0.99%.	Log loss: 0.2110, RMSE(bins): 0.94%
Loss after training: 0.2119, RMSE: 0.0123	Loss after training: 0.2119, RMSE: 0.0115
Due = 1363 cards	Due = 1388 cards

So, they are quite similar to each other, but in between the Rust Optimizer in 23.12.1 and 23.10.1

I am not happy with these results either. The reason: As I mentioned above, when I started completing the backlog given by the optimizer, I was getting almost all cards correct. So, I think that I should not have any backlog (or a very small backlog).

In the above table, the third row contains metrics calculated by Anki and the fourth row contains the metrics calculated by Py optimizer for the same weights.

By the way, I think that the problem with the Python optimizer is that it always produces w[7] = 0 for my collection. In contrast, the Rust optimizer gives a small but non-zero value (e.g. 0.0193, 0.0088, etc.). This means that with the parameters given by the Python optimizer, the difficulty of my cards can NEVER decrease (because I don't use Easy for review cards). So, if this is fixed, I guess that the Python optimizer would work fine for me.

As an experiment, in the parameters given by Py Optimizer v4.19.2, I replaced w[7] by 0.0193 and then rescheduled. By doing this, the due count decreased from 1363 cards to 559 cards.

Note: All the testing that I did with the Python optimizer was in Anki 23.10.1. So, there can be some inaccuracy in the number of due cards with parameters obtained using the Py v4.20.4 optimizer. But the inaccuracy won't be too large that I should reinstall 23.12.1 just to check the number of due cards.

user1823 commented 11 months ago

I guess that I have found the issue.

In some cards, I rated Again in a Filtered deck (with rescheduling) only a few days after I rated them Good. In my opinion, the Good rating was because of interference from other related cards that I reviewed on the same day or adjacent days. This seems to have confused FSRS even though such ratings were present in only 8 cards.

You can find such cards in the deck file shared in the first post of this issue by using the following search query in the Browser:

cid:1700673098704,1691514827314,1689521450583,1684167598943,1672238572726,1661483066910,1661483683057,1664457812204

I used the following in the Anki Debug Console to delete those unexpected Good ratings:

mw.col.db.execute("DELETE from revlog where cid = 1700673098704 and id > 1701801000000 and id < 1702233000000")
mw.col.db.execute("DELETE from revlog where cid = 1691514827314 and id > 1697740200000 and id < 1699122600000")
mw.col.db.execute("DELETE from revlog where cid = 1689521450583 and id > 1701369000000 and id < 1701714600000")
mw.col.db.execute("DELETE from revlog where cid = 1684167598943 and id > 1690223400000 and id < 1690569000000")
mw.col.db.execute("DELETE from revlog where cid = 1672238572726 and id > 1690223400000 and id < 1691173800000")
mw.col.db.execute("DELETE from revlog where cid = 1661483066910 and id > 1692729000000 and id < 1693247400000")
mw.col.db.execute("DELETE from revlog where cid = 1661483683057 and id > 1688149800000 and id < 1688495400000")
mw.col.db.execute("DELETE from revlog where cid = 1664457812204 and id > 1689359400000 and id < 1689791400000")

Then, on re-optimizing and rescheduling, Anki 23.12.1 gave me 597 due cards which is much better than the previous 2300+.

L-M-Sherlock commented 11 months ago

Thanks for the report. Sounds like the optimizer is sensitive to these reviews? Did you forget them completely? Maybe the better solution is just burying them instead of pressing again.

user1823 commented 11 months ago

Did you forget them completely? Maybe the better solution is just burying them instead of pressing again.

Yes, I forgot them. Also, they were not due. I reviewed them in a filtered deck, just to tell Anki that I have forgotten them.

L-M-Sherlock commented 11 months ago

By the way, according my recent analysis, it's not a good idea to re-optimizer frequently.

https://github.com/open-spaced-repetition/fsrs-benchmark/issues/25

Based on the experiments, if we re-optimize every 2000 reviews, the new parameters are better than the old one in 71% cases. But if we do that every 1000 reviews, the percentage drops down to 63%.

user1823 commented 11 months ago

By the way, I have performed the above-mentioned deletion of revlogs in my main profile as well as my test profile.

Optimization on my main profile still gives me 2000+ cards. Parameters: 1.2517, 3.9538, 22.1976, 35.3524, 4.9177, 1.4497, 1.5342, 0.0114, 1.8261, 0.1589, 1.1108, 1.6823, 0.1184, 0.6283, 0.4755, 0.0207, 4.0000

Optimization on my test profile (which contains yesterday's collection + above-mentioned change) gives me 587 cards when applied to my main profile. Parameters: 1.2502, 3.9521, 22.2139, 35.3634, 5.1795, 1.3051, 1.3681, 0.0108, 1.7835, 0.1059, 1.0901, 2.1143, 0.1215, 0.5364, 0.3711, 0.0060, 4.0000

So, it is consistent with your observation that it is not a good idea to re-optimize frequently. But, it also seems to be a serious issue.

L-M-Sherlock commented 11 months ago

If we re-optimize for every 4000 reviews, the new parameters are better in 85% cases.

zesky18 commented 11 months ago

If we re-optimize for every 4000 reviews, the new parameters are better in 85% cases.

If frequent optimization can lead to adverse effect on user's parameter, shouldn't FSRS then prevent user for doing this, similar to how there is a 1000 reviews limit for the initial optimization?

Vilhelm-Ian commented 11 months ago

That's an anki decision. Not an issue with the algorithm

open-spaced-repetition / fsrs4anki

Large backlog after update to FSRS v4.5 #572