[Feature Request] More accurate default parameters using Anki user's data and help from Dae

open-spaced-repetition / fsrs4anki

A modern Anki custom scheduling based on Free Spaced Repetition Scheduler algorithm

https://github.com/open-spaced-repetition/fsrs4anki/wiki

MIT License

2.79k stars 138 forks source link

[Feature Request] More accurate default parameters using Anki user's data and help from Dae #493

Closed Expertium closed 1 year ago

Expertium commented 1 year ago

Which module is related to your feature request? Scheduler, Optimizer

Is your feature request related to a problem? Please describe. I can't find the exact comment by @dae, but I'm sure I saw a comment saying that due to the way Anki licensing works, it's possible to use review data from way more users than just those who submitted their collections for research via the Google Form. So it's possible to run the optimizer on hundreds or even thousands of collections. This could help find the best default parameters. Of course, it's hard to say whether it's practically worth it because of diminishing returns as the number of collections increases.

user1823 commented 1 year ago

Even if the data isn't very useful for obtaining better default parameters, it can be useful for spaced repetition research.

By the way, here is the link to Dae's comment: https://github.com/open-spaced-repetition/fsrs-rs/pull/95#issuecomment-1742512359

Expertium commented 1 year ago

@L-M-Sherlock once you and Dae aren't so busy, I suggest working on this. Aside from finding more accurate default parameters, this can also help to benchmark FSRS and other algorithms more accurately. Currently, the benchmark repo has around 70 collections. If that number increased to 1000, that would be amazing.

user1823 commented 1 year ago

I wanted to mention the sample size that we need to achieve statistically significant results:

Assuming 10 million Anki users, with a 95% confidence level and 5% margin of error, we need 385 collections.

With 3% margin of error, we need 1067 collections.

At the current sample size (70) and 95% confidence interval, the margin of error is 11.71%.

If you want to play with the values, you can use this online calculator: http://www.raosoft.com/samplesize.html

dae commented 1 year ago

I have prepared a sample set of 20k collections. You can extract it with 'tar xaf ...'. It is a random sample of collections with 5000+ revlog entries, so it should contain a mix of older (still active) users, and newer users. Entries are pre-sorted in (cid, id) order. Please download a copy, as I'd like to remove it from the current location in a few weeks. You are welcome to re-host it elsewhere if you wish, but please preserve the LICENSE file if you do so.

https://apps.ankiweb.net/downloads/revlogs.tar.zst

Expertium commented 1 year ago

That's great, thank you! @L-M-Sherlock

L-M-Sherlock commented 1 year ago

Great! I will update the benchmark tomorrow.

L-M-Sherlock commented 1 year ago

I downloaded and unzipped it. Its size is 56.6GB. The main problem is I don't know the timezone and next_day_start_at of these revlogs. Without that info, I can't convert the revlogs to dataset that FSRS could process.

dae commented 1 year ago

Dang. I was too focused on ensuring privacy, and forgot about that part. I will need to rebuild the archive.

dae commented 1 year ago

Ok, I've replaced the archive with a new version. example.py has been updated, and you can now access next_day_at, which can be used to derive the cutoff hour (see RevlogEntry::days_elapsed)

L-M-Sherlock commented 1 year ago

What about the timezone?

dae commented 1 year ago

next_day_at can be used to determine the day a review log falls on without ever considering timezone or rollover hour. If the Python optimizer requires a timezone + rollover hour, I presume you could feed it UTC, and then determine the rollover hour in UTC based on next_day_at.

L-M-Sherlock commented 1 year ago

I'm coding for the pre-procession of dataset.

The file size of data has been reduced from 57.5GB to 13.7 GB. The next step is refactoring the benchmark program.

Expertium commented 1 year ago

Maybe rewrite all algorithms (and benchmarking code) in Rust? Of course, the Rust version of FSRS will be slightly different, and the Rust version of LSTM can he different too, but I think with a dataset this big speed is more important.

user1823 commented 1 year ago

The file size of data has been reduced from 57.5GB to 13.7 GB.

@L-M-Sherlock, for reducing the size of the data, I think that you would have filtered out many revlog entries such as Manual entries, entries before a Forget, outliers, etc.

There is no doubt that this was important for doing this benchmark experiment. However, I think that we should preserve a copy of the dataset without filtering any revlog entries for future research.

L-M-Sherlock commented 1 year ago

However, I think that we should preserve a copy of the dataset without filtering any revlog entries for future research.

But my google drive doesn't have enough storage space to preserve the copy.

Expertium commented 1 year ago

@L-M-Sherlock since Dae is now working on Anki 23.12 beta and since you have finished benchmarking FSRS v4, please give Dae the new default parameters that are based on 700+ million reviews.