open-spaced-repetition / fsrs-when-to-separate-presets

3 stars 0 forks source link

Findings about optimizing in deck and collection level #1

Closed L-M-Sherlock closed 7 months ago

L-M-Sherlock commented 10 months ago

@Expertium, I did an experiment in your collection:

https://github.com/open-spaced-repetition/fsrs-when-to-separate-presets/blob/main/split-vs-concat-top-to-bottom.ipynb

It reduces 16% RMSE(bins) via optimizing in deck level. I guess you would be interested.

Expertium commented 10 months ago

next_day_starts_at = 4 It's 5. It's not super important, but still. Can you explain what this code is doing? I'm havign a hard time understanding it. Also, seems like you've added some new metrics: image I'm assuming E50 is the median error (based on bins) and E90 is the 90 percentile (also based on bins). What's ICI?

L-M-Sherlock commented 10 months ago

What's ICI?

You can see this issue:

Expertium commented 10 months ago

Ok, but what about the code itself? Does it just run the optimizer on every single deck?

Expertium commented 10 months ago

ici = np.mean(np.abs(observation - p)) I don't think that's what the paper suggests. In the paper, the values are weighted by the empirical density function of the predicted probabilities. image

Expertium commented 10 months ago

So basically, right now we are using the number of reviews in each bin as weights. For ICI, we should use probability density. I think I could do that with FFTKDE, I'll try to tinker with it later and maybe I'll submit a PR.

L-M-Sherlock commented 10 months ago

I don't think that's what the paper suggests.

Did you check the appendix of the paper?

image

So basically, right now we are using the number of reviews in each bin as weights.

ICI does't require any bins.

L-M-Sherlock commented 10 months ago

Ok, but what about the code itself? Does it just run the optimizer on every single deck?

It filter out the decks containing >=1000 reviews and generate deck level's parameters for each one and predict one by one. We can evaluate the average error after joining them. And then optimize FSRS in the joined dataset and evaluate it with the collection level's parameters.

Expertium commented 10 months ago

Did you check the appendix of the paper?

That's weird, in the paper they clearly say that it should be weighted.

ICI does't require any bins.

I meant RMSE, sorry, my wording wasn't clear. I was trying to say "RMSE uses n reviews in each bin as weights, but since ICI is continuous, it should use a continuous counterpart - probability density".

L-M-Sherlock commented 10 months ago

The probability density has been in the array of p.

Expertium commented 10 months ago

p is predicted probability, observation is smoothed using lowess. What I'm saying is that, if I interpreted the paper correctly, then instead of this: ici = np.mean(np.abs(observation - p)) it should be this: ici = np.average(np.abs(observation - p), weights=pdf(p)) Where pdf(p) is an empirical probability density function. Remember, not all values of p are equally likely to occur. This is why for RMSE bins are used.

L-M-Sherlock commented 10 months ago

observation is smoothed using lowess

Here the lowess has applied pdf to observation. Because lowess is locally weighted scatterplot smoothing.

image
Expertium commented 10 months ago

Ok, my bad then.

giacomoran commented 10 months ago

The paper is correct, independently of using lowess in $f$.

Using the notation from the paper, we don't know $\phi$. We can only observe an empirical distribution $\hat{\Phi}_n$ from the predicted probabilities. There is a result that says that

$\mathbb{E}_{\hat{\Phi}_n}[f(X)] = \frac{1}{n}\displaystyle\sum_{i=1}^n f(x_i)$

where $X \sim P$ and $x_1, \dots, x_n$ observations from $P$.

In our case, the lhs is $\displaystyle\int_0^1 f(x) d\hat{\Phi}_n$ which approximates $\displaystyle\int_0^1 f(x) \phi(x) dx$; the rhs is np.mean(np.abs(observation - p)).

See for example https://math.stackexchange.com/q/1267634

Expertium commented 10 months ago

@L-M-Sherlock there is something I want you to investigate. Try selecting different thresholds, like 1000 reviews, 2000 reviews, 4000 reviews, etc., and seeing how well FSRS performs if all subdecks with <threshold reviews inherit the parent's parameters. The goal is to see whether there is such a thing as an optimal threshold. If the threshold is too low, it may not be a good idea to run FSRS on all decks, since a lot of them will have very few reviews, and we know that RMSE decreases as n(reviews) increases. But if the threshold is too high, we might end up grouping together decks with very different material. So there probably exists an optimal threshold.

L-M-Sherlock commented 10 months ago

Try selecting different thresholds, like 1000 reviews, 2000 reviews, 4000 reviews, etc., and seeing how well FSRS performs if all subdecks with <threshold reviews inherit the parent's parameters.

Assuming the threshold is 1000, the decks and their sizes are

deck   size
A::1   1000
A::2   2000
A::3   500

How to separate them? Which parameters should A::3 use?

Expertium commented 10 months ago

If A3 has a parent deck, it should use the parameters of the parent deck. If not, then use the global parameters, which can be obtained by running the optimizer on the entire collection.

L-M-Sherlock commented 10 months ago

OK. I guess the best way here is to optimize FSRS in all level of decks and save all parameters in a table for following tests.