Closed Expertium closed 1 year ago
Even if the data isn't very useful for obtaining better default parameters, it can be useful for spaced repetition research.
By the way, here is the link to Dae's comment: https://github.com/open-spaced-repetition/fsrs-rs/pull/95#issuecomment-1742512359
@L-M-Sherlock once you and Dae aren't so busy, I suggest working on this. Aside from finding more accurate default parameters, this can also help to benchmark FSRS and other algorithms more accurately. Currently, the benchmark repo has around 70 collections. If that number increased to 1000, that would be amazing.
I wanted to mention the sample size that we need to achieve statistically significant results:
Assuming 10 million Anki users, with a 95% confidence level and 5% margin of error, we need 385 collections.
With 3% margin of error, we need 1067 collections.
At the current sample size (70) and 95% confidence interval, the margin of error is 11.71%.
If you want to play with the values, you can use this online calculator: http://www.raosoft.com/samplesize.html
I have prepared a sample set of 20k collections. You can extract it with 'tar xaf ...'. It is a random sample of collections with 5000+ revlog entries, so it should contain a mix of older (still active) users, and newer users. Entries are pre-sorted in (cid, id) order. Please download a copy, as I'd like to remove it from the current location in a few weeks. You are welcome to re-host it elsewhere if you wish, but please preserve the LICENSE file if you do so.
That's great, thank you! @L-M-Sherlock
Great! I will update the benchmark tomorrow.
I downloaded and unzipped it. Its size is 56.6GB. The main problem is I don't know the timezone
and next_day_start_at
of these revlogs. Without that info, I can't convert the revlogs to dataset that FSRS could process.
Dang. I was too focused on ensuring privacy, and forgot about that part. I will need to rebuild the archive.
Ok, I've replaced the archive with a new version. example.py has been updated, and you can now access next_day_at, which can be used to derive the cutoff hour (see RevlogEntry::days_elapsed)
What about the timezone
?
next_day_at can be used to determine the day a review log falls on without ever considering timezone or rollover hour. If the Python optimizer requires a timezone + rollover hour, I presume you could feed it UTC, and then determine the rollover hour in UTC based on next_day_at.
I'm coding for the pre-procession of dataset.
The file size of data has been reduced from 57.5GB to 13.7 GB. The next step is refactoring the benchmark program.
Maybe rewrite all algorithms (and benchmarking code) in Rust? Of course, the Rust version of FSRS will be slightly different, and the Rust version of LSTM can he different too, but I think with a dataset this big speed is more important.
The file size of data has been reduced from 57.5GB to 13.7 GB.
@L-M-Sherlock, for reducing the size of the data, I think that you would have filtered out many revlog entries such as Manual entries, entries before a Forget, outliers, etc.
There is no doubt that this was important for doing this benchmark experiment. However, I think that we should preserve a copy of the dataset without filtering any revlog entries for future research.
However, I think that we should preserve a copy of the dataset without filtering any revlog entries for future research.
But my google drive doesn't have enough storage space to preserve the copy.
@L-M-Sherlock since Dae is now working on Anki 23.12 beta and since you have finished benchmarking FSRS v4, please give Dae the new default parameters that are based on 700+ million reviews.
Which module is related to your feature request? Scheduler, Optimizer
Is your feature request related to a problem? Please describe. I can't find the exact comment by @dae, but I'm sure I saw a comment saying that due to the way Anki licensing works, it's possible to use review data from way more users than just those who submitted their collections for research via the Google Form. So it's possible to run the optimizer on hundreds or even thousands of collections. This could help find the best default parameters. Of course, it's hard to say whether it's practically worth it because of diminishing returns as the number of collections increases.