[BUG] Exaggerated ranges after optimizing

JeffersonCorreia32a commented 1 year ago

Research

Enter an [x] character to confirm the points below:

[x ] I have checked the FAQ and could not find an answer to my question
[x ] I have read the wiki and still felt confused
[x ] I have searched for similar existing questions here

Question

Dear all,

I would like to understand why my deck is displaying such absurd values. The optimizer is providing parameters that deviate significantly from the original values. When I optimize the deck, the "w" values increase to an unreasonable extent, causing the application of these values to result in absurdly long intervals between card reviews. I don't understand the underlying problem.

The parameters I used were as follows:

Next Day Starts at (Step 2): 12 Timezone (Step 3.1): America/Bahia Filter out suspended cards: checked Advanced Settings (Step 3.2): 0.9 Revlog Start Date: Optimize review logs after this date: 2006-10-05 Example original "w" values: If I keep the original "w" as follows: [0.4, 0.6, 2.4, 5.8, 4.93, 0.94, 0.86, 0.01, 1.49, 0.14, 0.94, 2.18, 0.05, 0.34, 1.26, 0.29, 2.61] A new card will have a good interval of 2 days and an easy interval of 7 days.

Example optimized "w" values: If I use the optimized "w" for my deck as follows: [1.46, 4.9, 8.82, 10.08, 4.8994, 0.8604, 0.8373, 0.007, 1.5252, 0.1, 0.9738, 2.2595, 0.0204, 0.3773, 1.5048, 0.3335, 2.3037] A new card will have a good interval of 10 days and an easy interval of 12 days.

A good interval of 10 days for a new card seems clearly absurd...

I would like to know if others are experiencing such significant deviations from the original parameters or if it's just affecting my deck. This might help identify the issue.

This problem has been occurring for me since version 3, but it seems to have worsened recently.

I believe it would be helpful to find a way to maintain these parameters within an acceptable standard deviation for a human. In my opinion, it's unrealistic for a human to fully memorize content from today to 10 days in the future after seeing it only once. Perhaps, if the optimizer could set a maximum good interval of 3 or 4 days within a standard deviation, it would be more reasonable.

These discrepancies persist in older reviews, resulting in significantly increased intervals. Additionally, there is a substantial difference between the "good" and "easy" buttons. For example, one card had a good interval of 29 days while the easy interval was 2.6 months, representing a 250% difference.

I have included my complete deck, including media, for your evaluation. Thank you in advance for any effort made to help solve this case.

0PROJETO CONCURSO 20mil+26-07-2023 com midia.zip

L-M-Sherlock commented 1 year ago

A good interval of 10 days for a new card seems clearly absurd...

It doesn't. Some users have 30+ days for a new card when they press easy.

I checked your collection. The first forgetting curve when you press good is:

This graph shows that your retention is near 98% in 5 days.

The first forgetting curve when you press easy is:

This graph shows that your retention is higher than 90% even when the interval is 9 days.

JeffersonCorreia32a commented 1 year ago

A good interval of 10 days for a new card seems clearly absurd...

It doesn't. Some users have 30+ days for a new card when they press easy.

I checked your collection. The first forgetting curve when you press good is:

This graph shows that your retention is near 98% in 5 days.

The first forgetting curve when you press easy is:

This graph shows that your retention is higher than 90% even when the interval is 9 days.

I really don't think a 30-day interval for the first review is reasonable, one of the two, either this parameter called retention is addicted or the cards are too easy. Formerly, there was talk of the first review with 24 hours and the second with 7 days. Now the first review being 30 days is that's a lot in my opinion.

My cards are very heterogeneous and there are very difficult and very easy cards.

I'm going to try to select the cards that I find easy and temporarily suspend just to redo the optimization only with cards that I consider medium or difficult, because I don't want to have these cards harmed by the easy ones, I'd rather waste time and really have the expected retention, than having a false one feeling that I am withholding part of the content.

I would like the opinion of several users on this issue.

L-M-Sherlock commented 1 year ago

I really don't think a 30-day interval for the first review is reasonable, one of the two, either this parameter called retention is addicted or the cards are too easy.

This forgetting curve is from my collection.

This forgetting curve is from @Expertium. His initial stability for easy is 200+ days.

user1823 commented 1 year ago

@JeffersonCorreia32a, the initial stability for Good (the third value in w) in my collection is 13 days. Initially, I also thought that it was too large.

But, later while analyzing the data, I found that I forget only about 2% of the cards if I review them after 2-4 days. So, it makes sense that the stability would be around 13 days. (Remember that stability is defined as the time when the retrievability falls to 90%.)

Not only that, I could quite comfortably recall most of my cards when they come up for review. So, FSRS is right (at least for my collection) in saying that the first review interval should not be that small as users usually think. Also, don't think that I am learning something that is too easy.

A suggestion: The requestRetention option in the scheduler code is there to control how much fraction of the cards you retain when FSRS shows the cards to you. So, if you think that you are forgetting too much, you might want to increase the requestRetention.

Expertium commented 1 year ago

This forgetting curve is from @Expertium. His initial stability for easy is 200+ days.

In my case, that's because I already know some material that I have encountered outside of Anki.

galantra commented 1 year ago

More reasons:

Do not neglect the basics. Memorizing seemingly obvious things is not a waste of time!
A deck was created by someone else (or even by a program). E.g., a comprehensive language deck or Ultimate Geography.
Generous use of cloze deletion (especially now that nested clozes are possible).
Transfer of learning. E.g., you are learning two related languages at the same time.
People use SRS for years and years, such that long initial intervals can be reasonable (for less urgent material).

After all, it's one of the goals of the 20 rules of formulating knowledge to make cards easy.

JeffersonCorreia32a commented 1 year ago

I understand your concerns, and I also want FSRS4 to be as efficient as possible. I don't want to waste time unnecessarily reviewing something. However, I am quite worried about the significant difference between the original Anki algorithm and the FSRS4 configuration, and I'm unsure if the FSRS4 algorithm is suitable for me.

Let me clarify that my issue is not having long intervals after multiple reviews. My main concern is whether, after these large interval jumps since the initial learning of a new card, I will genuinely retain the content in my memory when I need it for an exam.

Currently, my learning step is only 20m. For example, a new card's schedule looks like this: Wrong button (20m), Difficult button (30min), Good button (10 days), Easy button (11 days). @Expertium so in your case on the new card the first Good button would be 200 days? if it's awesome! just like 30 days I also find it impressive for the first click on the good button @L-M-Sherlock

I will test continuing with the recommended intervals from the optimized "w" to see what happens. But let me ask you something. What learning steps do you use? I have been using only 20m because the first review scheduled by FSRS is 10 days, and the maximum recommended learning step is 1d. I am considering trying 20m, 1d. Additionally, I feel that the 24-hour review is important. Please share your thoughts and the learning steps you use.

To enrich the topic further, here's an example of a card I use that I believe is not easy to fully retain in memory when having long intervals without reviewing it with a shorter spacing from the beginning: Question: What are the fundamental objectives of the Federative Republic of Brazil, according to Article 3 of the Federal Constitution?

Answer: I - To build a free, just, and solidarity-based society; II - To ensure national development; III - To eradicate poverty and marginalization and reduce social and regional inequalities; IV - To promote the well-being of all, without prejudice to origin, race, gender, color, age, and any other forms of discrimination.

Of course, I have shorter cards with briefer answers since I always try to follow the recommendations for creating a card. However, there are cards where the intention is precisely to learn a block of content together, as it's not possible to separate it like in this example. If I do separate it, the information will be spaced out in my mind, making it harder to connect the parts that are actually interconnected.

Note: I also believe there should be a standard recommendation for learning steps, as FSRS4 already imposes its interval deadlines. Providing recommended learning steps would help people avoid uncertainty in choosing the right steps to use in conjunction with FSRS4. Simply stating that the maximum learning step is one day is too vague and does not provide enough confidence in selecting the appropriate step to use with FSRS4.

Note:Please don't forget to explain to me how you currently set up the learning stage.

AnyonicFugue commented 1 year ago

I also encountered the problem of exaggerated ranges, but much more severe. Here is the export of my learning data: Anki Data.zip

The V4 optimizer tells me the optimal parameters are [23.47, 30.0, 22.37, 23.47, 4.4425, 0.5031, 0.6348, 0.2575, 1.9887, 0.1041, 1.4385, 2.6968, 0.01, 0.8163, 1.7731, 0.2265, 2.9595] which gives insane intervals like:

rating history: 2,3,3,1,3,1,1,3,3,3,3,3
interval history: 0,30,177,910,822,3457,2445,1840,6584,22653,36500,36500,36500
difficulty history: 0,4.9,4.8,4.7,5.6,5.3,6.0,6.6,6.0,5.6,5.3,5.1,4.9

When I try the V3 optimizer, the parameters seem much more sensible: [1.4617, 1.4644, 4.5713, -0.1261, -0.3351, 0.4935, 1.8344, -0.15, 1.2098, 2.3701, -0.01, 0.5964, 1.3038] with intervals

rating history: 2,3,3,1,3,1,1,3,3,3,3,3
interval history: 0,0,3,16,14,61,31,21,87,312,987,2793,7178
difficulty history: 0,4.7,4.6,4.6,4.9,4.8,5.0,5.1,4.9,4.7,4.6,4.6,4.6

Some other information that might be useful: I set requestedRetention = 0.9. Currently schedule I use in Anki:

L-M-Sherlock commented 1 year ago

I also encountered the problem of exaggerated ranges, but much more severe. Here is the export of my learning data: Anki Data.zip

I found that your forgetting curves are weird. Your retention is very high in first 10 days.

JeffersonCorreia32a commented 1 year ago

I also encountered the problem of exaggerated ranges, but much more severe. Here is the export of my learning data: Anki Data.zip

I found that your forgetting curves are weird. Your retention is very high in first 10 days.

This is probably because he has the Learning steps of 4d 6d 14d. Shouldn't the longest gap of the learning steps be Max 1d? Then I'm going to ask: what is the ideal learning steps configuration to use with FSRS4?

AnyonicFugue commented 1 year ago

I also encountered the problem of exaggerated ranges, but much more severe. Here is the export of my learning data: Anki Data.zip

I found that your forgetting curves are weird. Your retention is very high in first 10 days.

I think the strange forgetting curves might be from my previous (unreasonable) schedule, i.e. Learning Steps = 3d, 12d, 21d, ... (I forgot the exact values but they are not too far away). When using that schedule my retention is much lower than now. This might explain the low retention rates at certain values of intervals.

Also, I reviewed very few cards for a week in march when cards are piling up, which led to lots of forgetting. I guess that also contributed to my strange forgetting curves.

AnyonicFugue commented 1 year ago

So I wonder is there a way to produce reasonable fsrs parameters. Should I delete the data points with exceptionally low retention and run the optimizer again?

Additional information that might be useful: I'm mainly using Anki to memorize English words to prepare for the GRE exam. The difficulty of different words can be very different for me. My memory is, I think, a bit above average, but far below the V4 optimizer predicts.

Expertium commented 1 year ago

Shouldn't the longest gap of the learning steps be Max 1d? Then I'm going to ask: what is the ideal learning steps configuration to use with FSRS4?

When you are using FSRS, yes, the longest learning step should be 1d. But I'm assuming @AnyonicFugue was using the default Anki scheduler with his 3d, 12d, 21d steps. The really weird part is the fact that initial stabilities seem to be about the same for all grades:

The V4 optimizer tells me the optimal parameters are [23.47, 30.0, 22.37, 23.47, 4.4425, 0.5031, 0.6348, 0.2575, 1.9887, 0.1041, 1.4385, 2.6968, 0.01, 0.8163, 1.7731, 0.2265, 2.9595]

The first 4 numbers are initial values of S for Again/Hard/Good/Easy. And they all are about the same. I've never seen anything like this, usually the values monotonically increase, like 1, 2, 4, 8, for example. Unless @AnyonicFugue just randomly chooses which button to press with his eyes closed, I don't know how to explain this.

AnyonicFugue commented 1 year ago

Shouldn't the longest gap of the learning steps be Max 1d? Then I'm going to ask: what is the ideal learning steps configuration to use with FSRS4?

When you are using FSRS, yes, the longest learning step should be 1d. But I'm assuming @AnyonicFugue was using the default Anki scheduler with his 3d, 12d, 21d steps. The really weird part is the fact that initial stabilities seem to be about the same for all grades:

The V4 optimizer tells me the optimal parameters are [23.47, 30.0, 22.37, 23.47, 4.4425, 0.5031, 0.6348, 0.2575, 1.9887, 0.1041, 1.4385, 2.6968, 0.01, 0.8163, 1.7731, 0.2265, 2.9595]

The first 4 numbers are initial values of S for Again/Hard/Good/Easy. And they all are about the same. I've never seen anything like this, usually the values monotonically increase, like 1, 2, 4, 8, for example. Unless @AnyonicFugue just randomly chooses which button to press with his eyes closed, I don't know how to explain this.

Yes, I've not used FSRS so all my cards are scheduled by the default Anki scheduler.

Of course I'm not randomly pressing buttons😂 But I think it can be difficult to accurately evaluate how well you remember a word and choose a button, especially for the first review (in my case, usually a day after making the card myself).

JeffersonCorreia32a commented 1 year ago

O intervalo mais longo das etapas de aprendizado não deveria ser Max 1d? Então eu vou perguntar: qual é a configuração de etapas de aprendizado ideal para usar com o FSRS4?

Quando você estiver usando FSRS, sim, a etapa de aprendizado mais longa deve ser 1d. Mas eu estou assumindo@AnyonicFugueestava usando o agendador Anki padrão com suas etapas 3d, 12d, 21d. A parte realmente estranha é o fato de que as estabilidades iniciais parecem ser as mesmas para todas as séries:

O otimizador V4 me diz que os parâmetros ideais são [ 23.47, 30.0, 22.37, 23.47 , 4.4425, 0.5031, 0.6348, 0.2575, 1.9887, 0.1041, 1.4385, 2.6968, 0.01, 0.8163, 1.7 731, 0,2265, 2,9595]

Os primeiros 4 números são valores iniciais de S para Novamente/Difícil/Bom/Fácil. E todos eles são quase iguais. Nunca vi nada assim, geralmente os valores aumentam monotonicamente, como 1, 2, 4, 8, por exemplo. A menos que@AnyonicFugueapenas escolhe aleatoriamente qual botão apertar com os olhos fechados, não sei como explicar isso.

Again I beg for information on proper learning steps. I'm only using one step at the moment. 20m and the good one is already determined by FSRS4. I don't know if I'm doing it right. or if for example there would have to be more learning steps like for example: 5m 20m 120m 1d.

This strikes me as a huge information gap that I need to fill.

What are the most recommended learning steps to use with FSRS4?

Expertium commented 1 year ago

What are the most recommended learning steps to use with FSRS4?

1d or shorter.

L-M-Sherlock commented 1 year ago

I think there are two completely different types of materials in your deck.

user1823 commented 1 year ago

Again I beg for information on proper learning steps.

That actually depends upon

the type of information that you are learning
the way you formulate your cards
your memory.

So, it is something of the sort of a hit and trial. Personally, I am using a single learning step of 15 minutes.

Another point: From the w that you posted, I see that your initial stability after Again rating is 1.46 days. So, if you use requestRetention = 0.90, your first interval for the cards that were rated Again would be 1 day (after you clear the learning steps). This is not very long in my opinion.

L-M-Sherlock commented 1 year ago

JeffersonCorreia32a commented 1 year ago

What are the most recommended learning steps to use with FSRS4?

1d or shorter.

Ok mas quantos passos e com quanto tempo cada exatamente, é isso que preciso saber. Afinal existe uma infinidade de possibilidades que podem prejudicar a eficiência e até invalidar toda a eficiência do FSRS4, se essa escolha fica aberta pra cada um escolher podem oxorrer absurdos, por exemplo: Exemplo absurdo : 1m 5m 10m 15 m 30m 60m 120m 240m 480m 1d O exemplo acima faria com que a pessoa perdesse muito tempo clicando caso clicasse em bom até passar pra o agendamento pelo FSRS4. Isso não é um learning steps eficiente.

Exemplo 2 que uso hoje: 20m Esse exemplo é o outro extremo que utilizo hoje, pois o 20m gica no botao(errei) e o bom já é determinado pelo FSS4. Entendo que deveria ter uma diretriz mais clara do que é mais recomendado.

AnyonicFugue commented 1 year ago

I think there are two completely different types of materials in your deck.

All cards are English words, but I've adjusted the review schedule multiple times to test which parameters works best, which I believe happens on all Anki enthusiasts :)
For some schedules (for example, 2d-4d-7d-8d-15d in the first image you uploaded) my retention is high, but for others (e.g. 7d-16d in the second image) my retention is lower.

I ran the optimizer again with Revlog Start Date = 2023-05-15 (I believed I'm using my current schedule since then, thus excluding many exceptional data points), and the optimizer returned [14.05, 17.71, 30.0, 30.0, 4.8101, 0.8208, 0.7424, 0.1285, 1.61, 0.1, 1.0598, 2.2977, 0.01, 0.4572, 1.3775, 0.4101, 2.7297]., with

rating history: 2,3,1,3,3,1,1,3,3,3,3,3
interval history: 0,18,59,14,41,116,20,8,19,47,116,283,681
difficulty history: 0,5.6,5.5,6.7,6.5,6.3,7.4,8.3,7.9,7.5,7.1,6.8,6.6

This seems much more resonable.

Expertium commented 1 year ago

@JeffersonCorreia32a I always use 2 steps: 15m 1d. I don't recommend using a lot of small steps, like 5m 10m 15m 30m 60m, as it has a negligible impact on long-term learning.

ghost commented 1 year ago

I might be mistaken, but as I understand it, the optimizer discards all the learning (or relearning) reviews except the first ones in each session. This means that the optimizer does not consider any reviews that have intervals longer than a day, and their impact on stability and difficulty. What I think is that these reviews, while technically being "Learn" type reviews, are not any different from ordinary "Review" type reviews in terms of their relation to the memory model. It may be beneficial for the optimizer instead of throwing away all the reviews in a single "Learn" or "Relearn" session, split the session by days, and keep the first review of each day. I think this should produce better results for users who happen to have long learning intervals in their review histories.

ghost commented 1 year ago

In other words, the optimizer should pick every first review of the day, regardless of its type Untitled (Image taken from https://www.reddit.com/r/Anki/comments/jl9w17/cards_i_relearned_yesterday_which_should_have/)

L-M-Sherlock commented 1 year ago

I might be mistaken, but as I understand it, the optimizer discards all the learning (or relearning) reviews except the first ones in each session. This means that the optimizer does not consider any reviews that have intervals longer than a day, and their impact on stability and difficulty. What I think is that these reviews, while technically being "Learn" type reviews, are not any different from ordinary "Review" type reviews in terms of their relation to the memory model. It may be beneficial for the optimizer instead of throwing away all the reviews in a single "Learn" or "Relearn" session, split the session by days, and keep the first review of each day. I think this should produce better results for users who happen to have long learning intervals in their review histories.

The optimizer doesn't discards all the learning (or relearning) reviews except the first ones in each session. The optimizer has split the session by days and kept the first review of each day. Please check the code here:

https://github.com/open-spaced-repetition/fsrs-optimizer/blob/420a10f6c1daad0bdef3c59fe8b184598a6a0761/src/fsrs_optimizer/fsrs_optimizer.py#L383

ghost commented 1 year ago

Unfortunately, processing the raw data is the part that I struggle to understand. If I was wrong, then I'm sorry for bothering.

AkiraChisaka commented 1 year ago

I want to chime in, and add that I am also having similar experiences.

With the v4 algorithm, it seems like it is a bit too aggressive and giving out huge w3 numbers.

For me, my w is [0.13, 0.18, 2.15, 30.0, 4.9269, 0.9457, 0.8685, 0.0056, 1.3993, 0.1, 0.8463, 2.2054, 0.0263, 0.3629, 1.2524, 0.3426, 2.6167]. I have ran the v4 optimizer quite a few times in the past few days, and it always gives me 30.0, which I believe is the max value for w3?

Considering that, I have been using Anki for exactly one month, and FSRS for no more than 3 weeks, isn't the number 30.0 just way too high? I am very uncertain if my retention for a card I rated as "easy" 30 days later will be anything close to 90%. Since I'm pretty sure the retention degradation is not 100% linear.

So yeah, I want to propose, that there should a cap on the length of the first interval after pressing "easy" on a card. Something like, cap it to how long you have been using Anki. This way, at least I know that FSRS's initial "easy" interval is somewhat based on precedence.

And yeah, I think in general, v4 feels a bit too... sensitive and aggressive. As in, when there isn't that much training data for the optimizer, as a user, it feels a bit like... FSRS is "just winging it" a lot of the times. So I feel like FSRS v4 should be a bit more conservative when deciding parameters, mostly for new users that are new to Anki in general and don't actually have that much review data.

Also, I think my requested retention is 90%, but my actual ratio of "none again" answered for young cards is only 80.37%. And I do admit it's starting to feel a bit demoralizing, feeling like the algorithm keeps giving me cards that I just forgotten.

Also here is all my data, feel free to check it out if it's relevant: collection-2023-07-29@16-31-40.zip

L-M-Sherlock commented 1 year ago

I want to chime in, and add that I am also having similar experiences.

Could you set verbose=True for optimizer.pretrain(verbose=False) in the section 2.1 when you run the optimizer? We can check your forgetting curve for first easy rating.

user1823 commented 1 year ago

@L-M-Sherlock and @Expertium, I am seeing that the highest number of support requests that we are getting these are related to high values of initial stability, especially from users who don't have sufficient review data.

This new approach of estimating the initial stability worked well for us, who have large amount of data. But, I think that it is not well suited for users who have insufficient review data.

So, I think that we should set a threshold below which the optimizer is not allowed to use this approach.

For example, we can use a total of 2000 cards as the minimum requirement to use curve_fit. If the total number of cards is less than 2000, we can bring back v3's pre-train approach (but save the 4 parameters for initial stability in the final w instead of 2, to retain compatibility with the v4 code).

The exact thresholds for deciding whether to use v4 approach, or to use v3 approach or to just say that the data is insufficient would need to be decided based on experiments with data of people having low number of reviews and also small samples from our own collections.

Note that there is a significant difference between the two types of datasets that I suggested to use:

In collections from new users of Anki, both the number of cards and the number of reviews per card would be low.
In small samples from our own collections, the number of cards would be low, but the number of reviews per card would be high.

JeffersonCorreia32a commented 1 year ago

@L-M-Sherlocke@Expertium, estou vendo que o maior número de solicitações de suporte que estamos recebendo está relacionado a altos valores de estabilidade inicial, especialmente de usuários que não possuem dados de revisão suficientes.

Essa nova abordagem de estimativa da estabilidade inicial funcionou bem para nós, que temos uma grande quantidade de dados. Mas acho que não é adequado para usuários com dados de revisão insuficientes.

Portanto, acho que devemos definir um limite abaixo do qual o otimizador não pode usar essa abordagem.

Por exemplo, podemos usar um total de 2.000 cartões como requisito mínimo para usar curve_fit. Se o número total de cartões for menor que 2.000, podemos trazer de volta a abordagem pré-treinamento do v3 (mas salve os 4 parâmetros para estabilidade inicial no w final em vez de 2, para manter a compatibilidade com o código v4).

Os limites exatos para decidir se usar a abordagem v4, ou usar a abordagem v3 ou apenas dizer que os dados são insuficientes precisariam ser decididos com base em experimentos com dados de pessoas com baixo número de avaliações e também pequenas amostras de nossas próprias coleções .

Observe que há uma diferença significativa entre os dois tipos de conjuntos de dados que sugeri usar:

Nas coletas de novos usuários do Anki, tanto o número de cartões quanto o número de avaliações por cartão seriam baixos.

Em pequenas amostras de nossas próprias coleções, o número de cartões seria baixo, mas o número de revisões por cartão seria alto.

I suggest something else instead of limiting the full function of the optimizer to a minimum number of 2000 thousand cards for example. I suggest that, when calculating the optimization of the values of w, it is gradually limited to one standard deviation, so that the more cards and revlogs there are, the freer they are from this expected standard deviation with increasing bands. Here is an image to illustrate the idea:

Expertium commented 1 year ago

That's an interesting idea, though I don't know how to choose the values of "bands".

L-M-Sherlock commented 1 year ago

Also here is all my data, feel free to check it out if it's relevant: collection-2023-07-29@16-31-40.zip

I test your collection. I find that the problem is not caused by

            params, _ = curve_fit(power_forgetting_curve, delta_t, recall, sigma=1/np.sqrt(count), bounds=((0.1), (30 if total_count < 1000 else 365)))

It is caused by

        def S0_rating_curve(rating, a, b, c):
            return np.exp(a + b * rating) + c

        params, covs = curve_fit(S0_rating_curve, list(rating_stability.keys()), list(rating_stability.values()), sigma=1/np.sqrt(list(rating_count.values())), method='dogbox', bounds=((-15, 0.03, -5), (15, 7, 30)))

Without the limit of 100 reviews, the fitting results are:

pretrain_0

pretrain_1

pretrain_2

pretrain_3

With the limit of 100 reviews, the fitting results are:

pretrain_0

pretrain_1

pretrain_2

@user1823 @Expertium, maybe to remove the limit of reviews is better.

L-M-Sherlock commented 1 year ago

I suggest something else instead of limiting the full function of the optimizer to a minimum number of 2000 thousand cards for example. I suggest that, when calculating the optimization of the values of w, it is gradually limited to one standard deviation, so that the more cards and revlogs there are, the freer they are from this expected standard deviation with increasing bands. Here is an image to illustrate the idea:

I have an idea. We can set p0 (the initial guess) and maxfev (the number of iteration) for the curve_fit method. A small maxfev could decrease the deviation between the fitting result and the initial guess.

user1823 commented 1 year ago

I have an idea. We can set p0 (the initial guess) and maxfev (the number of iteration) for the curve_fit method. A small maxfev could decrease the deviation between the fitting result and the initial guess.

By curve_fit, do you mean the following?

params, _ = curve_fit(power_forgetting_curve, delta_t, recall, sigma=1/np.sqrt(count), bounds=((0.1), (30 if total_count < 1000 else 365)))

If yes, I like the idea. But, we will have to test it with a wide range of collections (ones with large number of reviews as well as ones with small number of reviews) to ensure that it doesn't produce unintended results.

L-M-Sherlock commented 1 year ago

I replaced curve_fit with minimize. It is more flexible and customizable.

https://github.com/open-spaced-repetition/fsrs-optimizer/blob/723da1e03a00bfa336410a17ba9f1dc931db070f/src/fsrs_optimizer/fsrs_optimizer.py#L510-L524

In this implementation, I set the x0 and maxiter. It is conservative when the number of reviews is small. We could also add some penalty for large stability in the loss function.

AkiraChisaka commented 1 year ago

@L-M-Sherlock

Could you set verbose=True for optimizer.pretrain(verbose=False) in the section 2.1 when you run the optimizer? We can check your forgetting curve for first easy rating.

Ok, I gave it a go, let me see... I'm not super familiar with Google Colabs yet, so I'll just paste the stuff in section 2.1 here:

optimizer.define_model()
optimizer.pretrain(verbose=True)
optimizer.train(verbose=False)

{1: 0.128602622444628} Not enough data for first rating 2. Expected at least 100, got 8. {1: 0.128602622444628, 3: 2.2090953146625885} Not enough data for first rating 4. Expected at least 100, got 83. Weighted fit parameters: [-10.08928228 3.6075409 0.12707148] Fit stability: [0.12860262 2.20909531] RMSE: 0.0000

Yeah... this might be starting to look a bit buggy. Since I think rating 4 means "easy", and I am Not enough data for first rating 4. Expected at least 100, got 83..

And the optimizer kinda just decided "83/100? Yeah that's good enough let's give it a solid 30.0 out of 30.0"

user1823 commented 1 year ago

I replaced curve_fit with minimize. It is more flexible and customizable.

I tested this (by copying the relevant portions into fsrs4anki_optimizer_beta.ipynb) and also tried adjusting the values of maxiter. My results are tabulated below.

Setting	Initial Stability
Current optimizer	{1: 1.17, 2: 6.93, 3: 16.65, 4: 30.0}
maxiter = total_count	{1: 1.17, 2: 10.85, 3: 16.64, 4: 160.65}
maxiter = total_count / 10	{1: 1.17, 2: 1.19, 3: 16.64, 4: 160.65}
maxiter = total_count / 15	{1: 1.17, 2: 1.04, 3: 16.64, 4: 69.91}
maxiter = total_count / 20	{1: 1.17, 2: 1.04, 3: 16.64, 4: 39.76}
maxiter = total_count / 50	{1: 1.17, 2: 1.04, 3: 16.64, 4: 13.84}
maxiter = total_count / 100	{1: 1.17, 2: 1.04, 3: 16.64, 4: 8.73}
maxiter = int(np.sqrt(total_count))	{1: 1.17, 2: 1.53, 3: 16.64, 4: 83.19}

For my collection, I find the results of maxiter = total_count / 15 and maxiter = total_count / 20 to be more or less acceptable. But we will have to find a solution that works well for all the users.

By the way, I found the p0 for Hard to be too low. So, I changed it to 1.5 and then I got {1: 1.17, 2: 1.54, 3: 16.64, 4: 69.91} with maxiter = total_count / 15 and {1: 1.17, 2: 2.47, 3: 16.64, 4: 83.19} with maxiter = int(np.sqrt(total_count)).

After making these changes, I think that p0 = 1.5 + maxiter = int(np.sqrt(total_count)) works best for my collection.

Expertium commented 1 year ago

Try maxiter = int(np.sqrt(total_count)), I'm curious.

L-M-Sherlock commented 1 year ago

I have added L1 regularization for initial stability. It performs better.

Without Fix/exaggerated-initial-stability-after-optimizing:

R-squared: 0.9605 RMSE: 0.0154 MAE: 0.0052 [0.04796209 0.94900968]

With Fix/exaggerated-initial-stability-after-optimizing:

R-squared: 0.9619 RMSE: 0.0150 MAE: 0.0054 [0.03084845 0.96910888]

The RMSE decreases.

L-M-Sherlock commented 1 year ago

%pip install git+https://github.com/open-spaced-repetition/fsrs-optimizer@Fix/exaggerated-initial-stability-after-optimizing

The line of code could install the dev package in the notebook.

Expertium commented 1 year ago

@JeffersonCorreia32a Sherlock changed the way S0 is calculated, try it again (see his comment above).

user1823 commented 1 year ago

I have added L1 regularization for initial stability. It performs better.

L1 regularization increased the RMSE for me.

Loss = MSE

{1: 1.17, 2: 2.47, 3: 16.64, 4: 83.19}

R-squared: 0.9487
RMSE: 0.0113

Last rating: 1
R-squared: 0.7937
RMSE: 0.0267

Last rating: 2
R-squared: -0.2459
RMSE: 0.1157

Last rating: 3
R-squared: 0.9622
RMSE: 0.0091

Last rating: 4
R-squared: -27.4365
RMSE: 0.0397

Loss = MSE + L1

{1: 1.17, 2: 1.74, 3: 8.49, 4: 16.6}

R-squared: 0.9443
RMSE: 0.0114

Last rating: 1
R-squared: 0.8448
RMSE: 0.0227

Last rating: 2
R-squared: -0.3572
RMSE: 0.1213

Last rating: 3
R-squared: 0.9345
RMSE: 0.0118

Last rating: 4
R-squared: -246.6033
RMSE: 0.0923

All the other factors were the same i.e. s0 for Hard rating = 1.5 and maxiter = int(np.sqrt(total_count).

Expertium commented 1 year ago

I would do proper tests of statistical significance with many decks and collections. but it's too much of a pain without having a version of the optimizer with visible code.

L-M-Sherlock commented 1 year ago

Could you use the command line version to test statistical significance? It's faster and you can also modify the code.

python ./src/fsrs_optimizer/__main__.py "./"

This command could optimizer all decks and collections in the current directory.

Expertium commented 1 year ago

I have a better idea. Here's the code:

import scipy
list1 = []
list2 = []
wilcox = scipy.stats.wilcoxon(list1, list2).pvalue

list1 and list2 should: 1) have the same length 2) contain values ordered in the same way. So if list1 has a1, a2, a3...an, then in list2 the values should also be b1, b2, b3...bn. Here a and b are values of RMSE before and after some change.

One of the lists should contain baseline values of RMSE, and the other should contain values of RMSE after making a change. If a1 is the RMSE of some deck, then b1 should be the RMSE of the same deck after making a change to the optimizer. Note that a low p-value doesn't tell you which one has lower average RMSE, that's something you'll have to calculate separately to find out which one is better. So now all you have to do is choose some version of FSRS as the baseline and fill one of the lists with values. After that, you can use this code to run the optimizer on all collections from all users who have submitted their data and calculate statistical significance after all optimizations are finished.

L-M-Sherlock commented 1 year ago

I did the tests between 4.5.3 and current branch:

	old RMSE	new RMSE	old init S	new init S
collection-2022-09-18@13-21-58	0.0154	0.0157	1.14, 1.01, 5.44, 14.67	1.14, 1.01, 5.43, 14.11
Main_27.04.2023	0.0549	0.0544	0.1, 0.32, 1.14, 196.37	0.6, 0.32, 1.14, 147.46
Default	0.0128	0.0128	1.0, 2.0, 6.81, 30.0	1.0, 1.46, 6.81, 34.62
0PROJETO CONCURSO 20mil+26-07-2023 com midia	0.0355	0.0345	1.46, 4.9, 8.82, 10.08	1.45, 2.75, 7.84, 9.27
English	0.0201	0.0182	23.47, 30.0, 22.37, 23.47	9.27, 32.73, 21.92, 25.33

p = 0.14

The improvement is not significant statistically. But it really decrease the deviation when the review counts is small.

user1823 commented 1 year ago

I did the tests between 4.5.3 and current branch:

What if you make the following changes?

use maxiter = int(np.sqrt(total_count) and
increase the S0 for Hard to 1.5 (0.6 is too small)

L-M-Sherlock commented 1 year ago

I did the tests between 4.5.3 and current branch:

What if you make the following changes?

use maxiter = int(np.sqrt(total_count) and

increase the S0 for Hard to 1.5 (0.6 is too small)

	old	new	user1823	old	new	user1823
collection-2022-09-18@13-21-58	0.0154	0.0157	0.0157	1.14, 1.01, 5.44, 14.67	1.14, 1.01, 5.43, 14.11	1.14, 1.01, 5.43, 14.11
Main_27.04.2023	0.0549	0.0544	0.0544	0.1, 0.32, 1.14, 196.37	0.6, 0.32, 1.14, 147.46	0.6, 0.32, 1.14, 147.46
Default	0.0128	0.0128	0.0128	1.0, 2.0, 6.81, 30.0	1.0, 1.46, 6.81, 34.62	1.0, 1.46, 6.81, 34.62
0PROJETO CONCURSO 20mil+26-07-2023 com midia	0.0355	0.0345	0.0345	1.46, 4.9, 8.82, 10.08	1.45, 2.75, 7.84, 9.27	1.45, 2.4, 7.84, 9.27
English	0.0201	0.0182	0.0223	23.47, 30.0, 22.37, 23.47	9.27, 32.73, 21.92, 25.33	4.1, 32.77, 21.92, 20.89

p=1.0

It seems affect again and easy stability of English and hard stability of 0PROJETO CONCURSO 20mil+26-07-2023 com midia. Their review counts are all less than 100.

Although it increased the RMSE for English, it achieved our original goal in current issue.

user1823 commented 1 year ago

So, I guess that the effect which was to be achieved by using sqrt of total_count in maxiter has already been achieved by using L1 regularization. I think so because using sqrt in maxiter significantly affected the results in my testing yesterday.

But, I still suggest using S0 = 1.5 for Hard.

Also, I am curious why the English deck has such high initial stability values.

L-M-Sherlock commented 1 year ago

But, I still suggest using S0 = 1.5 for Hard.

I have used it. But it seems not to affect the result.

open-spaced-repetition / fsrs-optimizer

[BUG] Exaggerated ranges after optimizing #5