open-spaced-repetition / fsrs-optimizer

FSRS Optimizer Package
https://pypi.org/project/FSRS-Optimizer/
BSD 3-Clause "New" or "Revised" License
62 stars 12 forks source link

A better outlier filter for "Compute minimum recommended retention" #112

Closed Expertium closed 3 months ago

Expertium commented 4 months ago

Currently, we filter out reviews where time=0 and time>=20 minutes. However, if the user set their "Maximum answer seconds" to 60 (default), none of this will help. So I have an idea: 1) Select all review times (after filtering out t=0 and t>=20 minutes) 2) Find their maximum, max(t) 3) Remove all values that are equal to the maximum

Here's the key idea: we don't know what value the user chose as their "Maximum answer seconds". We don't have access to that setting. But we can guess what it was based on the max. value of all t. For example, if the maximum is 60 seconds, it's reasonable to assume that that's the "Maximum answer seconds". Then we can remove all reviews that are equal to that.

So if a user has times like this: 7, 8, 9, 10, 12, 15, 20, 60, 60, 60.

After the filter is applied, they will become this: 7, 8, 9, 10, 12, 15, 20

@user1823 I want to know your opinion as well

user1823 commented 4 months ago

Here is what I wrote about this before:

During calculation of the median, the exact values of the lowest and highest values don't matter. So, I don't think that we need to remove the entries equal to the maximum limit.

Rather, removing those entries would cause the median to become unexpectedly small.

Originally posted by @user1823 in https://github.com/open-spaced-repetition/fsrs-optimizer/issues/107#issuecomment-2067899702

For an example, let's say that a user had answer times like this: 40, 50, 60, 60, 60, 60, 60, 60, 60, 60, 60

In this case, I would believe that the average answer time was 60 seconds (or even more) but the default setting capped most of the answer times to 60 seconds and if you filter out these values, the median would be unreasonably small.

By the way, if you think that calculation of the average times still requires improvement, I suggest taking help from Durasba1. Based on their responses in https://forums.ankiweb.net/t/clarify-what-optimal-retention-means/42803/, they seem to be knowledgeable in this field.

L-M-Sherlock commented 4 months ago

I agree with user1823. The median value is not sensitive to the outlier, so I think the current outlier filter is good enough.