open-spaced-repetition / srs-benchmark

A benchmark for spaced repetition schedulers/algorithms
https://github.com/open-spaced-repetition/fsrs4anki/wiki
55 stars 9 forks source link

Inclusion of any of the boosting models #102

Closed Gilfaro closed 2 weeks ago

Gilfaro commented 1 month ago

Would it be possible to include any of the boosting models like catboost, lightgbm or xgboost, since they are very good with tabular timeseries data such as this and there is already neural network compared.

L-M-Sherlock commented 1 month ago

PR is welcome.

Gilfaro commented 1 month ago

Will try modifying oither.py as it looks most of the needed preprocessing is already done for other models.

Expertium commented 1 month ago

@Gilfaro hey, just wanted to let you know that we're done benchmarking new algos and re-benchmarking old ones, but it's not too late (it's never too late, for that matter) to add more

Gilfaro commented 3 weeks ago

Didn't have much time, but tested boosting on top of fsrs5 as the signals with stability and difficulty. The results on test data:

Model: FSRS-5-catboost
Total number of users: 10
Total number of reviews: 255400
Weighted average by reviews:
FSRS-5-catboost LogLoss (mean±std): 0.3622±0.1144
FSRS-5-catboost RMSE(bins) (mean±std): 0.0418±0.0117
FSRS-5-catboost AUC (mean±std): 0.7254±0.0765

Weighted average by log(reviews):
FSRS-5-catboost LogLoss (mean±std): 0.3505±0.1557
FSRS-5-catboost RMSE(bins) (mean±std): 0.0528±0.0221
FSRS-5-catboost AUC (mean±std): 0.7213±0.0784

Weighted average by users:
FSRS-5-catboost LogLoss (mean±std): 0.3480±0.1606
FSRS-5-catboost RMSE(bins) (mean±std): 0.0545±0.0233
FSRS-5-catboost AUC (mean±std): 0.7189±0.0757
Model: FSRS-5
Total number of users: 10
Total number of reviews: 255400
Weighted average by reviews:
FSRS-5 LogLoss (mean±std): 0.3737±0.1183
FSRS-5 RMSE(bins) (mean±std): 0.0496±0.0188
FSRS-5 AUC (mean±std): 0.7064±0.0897

Weighted average by log(reviews):
FSRS-5 LogLoss (mean±std): 0.3660±0.1512
FSRS-5 RMSE(bins) (mean±std): 0.0647±0.0245
FSRS-5 AUC (mean±std): 0.6741±0.1350

Weighted average by users:
FSRS-5 LogLoss (mean±std): 0.3641±0.1547
FSRS-5 RMSE(bins) (mean±std): 0.0671±0.0247
FSRS-5 AUC (mean±std): 0.6668±0.1387
Expertium commented 3 weeks ago

Interesting! Could you share the code? Funnily enough, less than 2 hours ago I said "There seems to be some kind of curse surrounding people who want to add decision tree models to the benchmark. Both of these guys practically disappeared" image

Gilfaro commented 3 weeks ago

@Expertium Added a draft with code. Probably would be good to also have some lagging indicators without piggy backing on already very good FSRS-5 for a better comparison.

Expertium commented 3 weeks ago

Can you make a decision tree based model that doesn't borrow anything from FSRS? To be honest, that's what I was expecting.

Gilfaro commented 2 weeks ago

That would need to implement forgetfulness curve and other lagging features that fsrs already provides, which I unfortunately do not have time currently. Boosting algos cannot interpret timeseries data directly, but only by values that describe the features of it.

Expertium commented 2 weeks ago

Then how about HLR-catboost?

Gilfaro commented 2 weeks ago

I did a very fast hijack. As you can see it significantly improves the base model, but it's just as good as the indicators you feed it. In this case it shows that HLR is a worse indicator than FSRS stability and difficulty.

Model: HLR-catboost
Total number of users: 10
Total number of reviews: 255400
Weighted average by reviews:
HLR-catboost LogLoss (mean±std): 0.3786±0.1125
HLR-catboost RMSE(bins) (mean±std): 0.0549±0.0211
HLR-catboost AUC (mean±std): 0.6979±0.0775

Weighted average by log(reviews):
HLR-catboost LogLoss (mean±std): 0.3691±0.1608
HLR-catboost RMSE(bins) (mean±std): 0.0650±0.0273
HLR-catboost AUC (mean±std): 0.6882±0.0783

Weighted average by users:
HLR-catboost LogLoss (mean±std): 0.3664±0.1668
HLR-catboost RMSE(bins) (mean±std): 0.0662±0.0277
HLR-catboost AUC (mean±std): 0.6845±0.0750
Model: HLR
Total number of users: 10
Total number of reviews: 255400
Weighted average by reviews:
HLR LogLoss (mean±std): 0.4938±0.2028
HLR RMSE(bins) (mean±std): 0.1144±0.0628
HLR AUC (mean±std): 0.6416±0.1154

Weighted average by log(reviews):
HLR LogLoss (mean±std): 0.4905±0.2182
HLR RMSE(bins) (mean±std): 0.1363±0.0553
HLR AUC (mean±std): 0.6115±0.1279

Weighted average by users:
HLR LogLoss (mean±std): 0.4842±0.2184
HLR RMSE(bins) (mean±std): 0.1378±0.0535
HLR AUC (mean±std): 0.6065±0.1248
Expertium commented 2 weeks ago

That's a huge improvement! @L-M-Sherlock I think we should include HLR-catboost in the benchmark. I have 2 questions: 1) How many parameters does it have? 2) Can you explain how the probability of recall is calculated, or link an article that explains the inner workings of catboost?

Gilfaro commented 2 weeks ago
  1. It doesn't operate on parameters, there is a variable amount of symmetric decision trees with a depth of N, each of the trees has 2^N leaf nodes. the default used is depth of 6.
  2. The final probability is just a sum of all of the weights of its terminal leaf nodes of all of the decision trees. If you want to know how boosting works you can read the papers linked in https://catboost.ai/en/docs/concepts/educational-materials-papers
Expertium commented 2 weeks ago

@Gilfaro how about this? 1) Feed interval lengths and ratings into catboost 2) Make it output a number between 0.01 and 36500, memory stability 3) Use this formula

    def forgetting_curve(self, t, s):
        return 0.9 ** (t / s)

to output a value of probability

Basically, instead of making catboost predict the probability directly, make it predict memory stability. And if possible, feed the previous value of memory stability into it as well, so that stability(N+1) depends on stability(N), where N is the number of reviews.

Gilfaro commented 2 weeks ago
  1. Input has to be constant and NaNs will get converted to either max or mix
  2. No way to clamp, unless you force after output
  3. This is why GRU and all the other NN models fail in the ranking. The formula has no direct correlation with the data you are feeding to the NN / other machine learning and boosting will fail in the same way. However if you do a binary classification then they shine, cause there is direct correlation. There is also no way to feed proper stability at learning time and just guessing about delta_t lands you about GRU or HLR. Also the test is flawed in such a way that it only tests for how well you do in binary classification which GRP-P smokes the competition.
Expertium commented 2 weeks ago

Predicting the probability directly (as opposed to using a pre-defined formula and predicting memory stability, which will be used in that formula) does seem to be better based on GRU-P, but the issue is that such algorithms cannot be used for scheduling. If you are developing an algorithm that will actually be used in practice, it needs to have the property that I call "invertibility": the calculation of the probability of recall corresponding to a specific interval length and the calculation of the interval length corresponding to a specific probability of recall are equally easy. It's not necessary for the benchmark, but it is necessary if you want your algorithm to be deployed in apps. We already have plenty of non-invertible algorithms, so ideally, I'd like to see more invertible ones.

Comparison table 1 2

Gilfaro commented 2 weeks ago

That is completely ok, but I would guess you could remove all Invertible algos as they cannot be used as scheduling anyway. And from the previous message any of the NN will not be able to reach any good values in the tests as long as you use forgetting_curve as their optimizer function. I will close the issue as there isn't a point in implementing another Invertible algo.

Expertium commented 2 weeks ago

I would guess you could remove all Invertible algos as they cannot be used as scheduling anyway.

You mean non-invertible? Invertible ones are the ones that can be used for scheduling.

Expertium commented 2 weeks ago

I'm kinda bummed, I was hoping to see how decision trees compare to neural nets, like GRU.

Gilfaro commented 2 weeks ago

Exactly the same tier, just as bad when you try to match forgetting_curve. Stats in the vicinity of GRU and HLR.

Gilfaro commented 2 weeks ago

Just switching forgetting_curve on GRU from the default to the same that FSRS uses makes it rank close to FSRS. This is how unfair it is if you rank them on different forgetting curves.

Total number of users: 10
Total number of reviews: 255405
Weighted average by reviews:
GRU LogLoss (mean±std): 0.5728±0.2561
GRU RMSE(bins) (mean±std): 0.1946±0.1016
GRU AUC (mean±std): 0.5597±0.0801

Weighted average by log(reviews):
GRU LogLoss (mean±std): 0.6825±0.2708
GRU RMSE(bins) (mean±std): 0.2712±0.1038
GRU AUC (mean±std): 0.5092±0.0803

Weighted average by users:
GRU LogLoss (mean±std): 0.6921±0.2713
GRU RMSE(bins) (mean±std): 0.2788±0.1030
GRU AUC (mean±std): 0.5039±0.0770
Model: GRU
Total number of users: 10
Total number of reviews: 255405
Weighted average by reviews:
GRU LogLoss (mean±std): 0.3945±0.1247
GRU RMSE(bins) (mean±std): 0.0659±0.0373
GRU AUC (mean±std): 0.6806±0.0928

Weighted average by log(reviews):
GRU LogLoss (mean±std): 0.4096±0.1798
GRU RMSE(bins) (mean±std): 0.1035±0.0611
GRU AUC (mean±std): 0.6095±0.1319

Weighted average by users:
GRU LogLoss (mean±std): 0.4113±0.1869
GRU RMSE(bins) (mean±std): 0.1095±0.0626
GRU AUC (mean±std): 0.5947±0.1308
Expertium commented 2 weeks ago

Interesting. @L-M-Sherlock, do you think we should use the same forgetting curve for all algorithms (excluding DASH and ACT-R since they do their own thing)?

L-M-Sherlock commented 2 weeks ago

do you think we should use the same forgetting curve for all algorithms

I think the formula of forgetting curve is also a part of algorithm. FSRSv3, FSRSv4 and FSRS-4.5 use different forgetting curves.

Gilfaro commented 2 weeks ago

Then it sounds fine as it doesn't say anywhere that most of the algos use 0.9 * (t / s) instead of (1 + FACTOR t / s) ** DECAY This will improve most of the algos in the benchmark. even HLR scores some points.

Also did the whole get used to derive fsrs forgetting_curve? If so then there is test data contamination for fsrs and it's basically cheating via pretraining on test data. This is similar if you pretrrained NN network on the hole dataset. If this is true then it is obvious why switching forgetting_curve on any other algo makes such a huge difference as you are already implicitly training on test data sets. This is unfixable for FSRS but to make the testing fair then all of them should cheat in the same way.