open-spaced-repetition / srs-benchmark

A benchmark for spaced repetition schedulers/algorithms
https://github.com/open-spaced-repetition/fsrs4anki/wiki
65 stars 9 forks source link

[Feature Request] Train a gradient-boosted decision tree #28

Closed maxencefrenette closed 9 months ago

maxencefrenette commented 11 months ago

Although transformers are probably what would give the best performance with enough training and tweaking of hyperparameters, I suspect that a gradient boosted decision tree ensemble model might outperform FSRS with very little tweaking using a methodology similar to this: https://machinelearningmastery.com/xgboost-for-time-series-forecasting/. It would, however be a much heavier model with many more parameters than even the LSTM that was attempted.

This is something i'd be interested in exploring if I could have access to the training data.

L-M-Sherlock commented 11 months ago

Here are 10 users' datasets: tiny_dataset.zip

You can use them for testing your model. PR is welcome. I can help you benchmark the model.

maxencefrenette commented 11 months ago

I'll see what sort of results I can get with this. Thanks for the data!

imrryr commented 10 months ago

So, I'm trying to run your script.py with this dataset, and it creates an evaluation directory, but it is empty. (I put the dataset in the dataset directory). Can you help me with the next steps, please? By the way, this is Pavlik, working with Hannah-Joy Simms

Expertium commented 10 months ago

Not sure if that helps, but I use cmd (Windows) and the following command: set DEV_MODE=1 && python script.py

imrryr commented 10 months ago

That doesn't produce changes. I think the problem is that it may not be finding the data, but I'm not sure how to check for that.

Expertium commented 10 months ago

Do you have the fsrs-optimizer repo downloaded too? script.py relies on fsrs_optimizer.py.

if os.environ.get("DEV_MODE"):
    # for local development
    sys.path.insert(0, os.path.abspath("../fsrs-optimizer/src/fsrs_optimizer/"))

from fsrs_optimizer import (
    Optimizer,
    Trainer,
    FSRS,
    Collection,
    power_forgetting_curve,
)
imrryr commented 10 months ago

I did it like this, is it right: PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> python -m pip install fsrs-optimizer Collecting fsrs-optimizer Using cached FSRS_Optimizer-4.20.8-py3-none-any.whl.metadata (4.2 kB) Requirement already satisfied: matplotlib>=3.7.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (3.8.2) Requirement already satisfied: numpy>=1.22.4 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (1.26.3) Requirement already satisfied: pandas>=1.5.3 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (2.1.4) Requirement already satisfied: pytz>=2022.7.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (2023.3.post1) Requirement already satisfied: scikit-learn>=1.2.2 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (1.3.2) Requirement already satisfied: torch>=1.13.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (2.1.2) Collecting tqdm>=4.64.1 (from fsrs-optimizer) Using cached tqdm-4.66.1-py3-none-any.whl.metadata (57 kB) Collecting statsmodels>=0.13.5 (from fsrs-optimizer) Downloading statsmodels-0.14.1-cp311-cp311-win_amd64.whl.metadata (9.8 kB) Requirement already satisfied: contourpy>=1.0.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (4.47.2) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (1.4.5) Requirement already satisfied: packaging>=20.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (23.2) Requirement already satisfied: pillow>=8 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (10.2.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (2.8.2) Requirement already satisfied: tzdata>=2022.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.5.3->fsrs-optimizer) (2023.4) Requirement already satisfied: scipy>=1.5.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=1.2.2->fsrs-optimizer) (1.11.4) Requirement already satisfied: joblib>=1.1.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=1.2.2->fsrs-optimizer) (1.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=1.2.2->fsrs-optimizer) (3.2.0) Collecting patsy>=0.5.4 (from statsmodels>=0.13.5->fsrs-optimizer) Downloading patsy-0.5.6-py2.py3-none-any.whl.metadata (3.5 kB) Requirement already satisfied: filelock in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (3.13.1) Requirement already satisfied: typing-extensions in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (4.9.0) Requirement already satisfied: sympy in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (1.12) Requirement already satisfied: networkx in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (3.2.1) Requirement already satisfied: jinja2 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (3.1.3) Requirement already satisfied: fsspec in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (2023.12.2) Collecting colorama (from tqdm>=4.64.1->fsrs-optimizer) Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB) Requirement already satisfied: six in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from patsy>=0.5.4->statsmodels>=0.13.5->fsrs-optimizer) (1.16.0) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from jinja2->torch>=1.13.1->fsrs-optimizer) (2.1.3) Requirement already satisfied: mpmath>=0.19 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from sympy->torch>=1.13.1->fsrs-optimizer) (1.3.0) Downloading FSRS_Optimizer-4.20.8-py3-none-any.whl (25 kB) Downloading statsmodels-0.14.1-cp311-cp311-win_amd64.whl (9.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.9/9.9 MB 19.1 MB/s eta 0:00:00 Using cached tqdm-4.66.1-py3-none-any.whl (78 kB) Downloading patsy-0.5.6-py2.py3-none-any.whl (233 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.9/233.9 kB 14.0 MB/s eta 0:00:00 Installing collected packages: patsy, colorama, tqdm, statsmodels, fsrs-optimizer Successfully installed colorama-0.4.6 fsrs-optimizer-4.20.8 patsy-0.5.6 statsmodels-0.14.1 tqdm-4.66.1

Expertium commented 10 months ago

Try running this line in cmd again (and make sure that fsrs-benchmark and fsrs-optimizer have the same parent folder, for example C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark and C:\Users\ppavl\Dropbox\Active projects\fsrs-optimizer): set DEV_MODE=1 && python script.py If that doesn't work, then idk, you'll have to wait for LMSherlock to respond.

L-M-Sherlock commented 10 months ago

So, I'm trying to run your script.py with this dataset, and it creates an evaluation directory, but it is empty. (I put the dataset in the dataset directory). Can you help me with the next steps, please? By the way, this is Pavlik, working with Hannah-Joy Simms

Did you see the result directory?

image

imrryr commented 10 months ago

Yes, it was there from the start. It is unchanged after running the script

L-M-Sherlock commented 10 months ago

Could you paste the output of script displayed in the terminal?

imrryr commented 10 months ago

Yes, but it is blank:

PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> $env:DEV_MODE="1"; python script.py PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark>

and

PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> set DEV_MODE=1
PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> python script.py PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark>

L-M-Sherlock commented 10 months ago

Weird. Nothing happened after the execution? I'm sorry I can't help you because I don't have a windows device.

L-M-Sherlock commented 10 months ago

Could you check the file path of your dataset?

imrryr commented 10 months ago

You can see it on the left. I wasn't sure of the format, so I offered the tiny dataset as csv, in the folder, and as a zip. script py - fsrs-benchmark - Visual Studio Code 1_12_2024 9_50_18 AM

L-M-Sherlock commented 10 months ago

It's weird. Could you add print(os.getcwd()) below if __name__ == "__main__":? I guess it's a path related problem.

imrryr commented 10 months ago

It says: C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark

L-M-Sherlock commented 10 months ago

Maybe you can print(unprocessed_files) to check whether the dataset has been read.

imrryr commented 10 months ago

So, for my configuration it wasn't overwriting the old results directory that was there in github, I renamed this directory to results2, and now it creates the results directory as expected. I'll likely have some questions, so I'll send you an email unless you prefer I post them here as new issues.

Expertium commented 10 months ago

@imrryr how's the progress?

imrryr commented 10 months ago

Well, pretty good. I'm trying to get some appropriate data to compare this with some of our methods (e.g. https://scholar.google.com/citations?view_op=view_citation&hl=en&user=Ye48zsYAAAAJ&sortby=pubdate&citation_for_view=Ye48zsYAAAAJ:iyewoVqAXLQC ). I contacted Dae and am also looking at the MaiMemo data. I'm a little confused now since I realize I don't know the formal relationship of FSRS 4.5 and SSP-MMC. I'd be happy if someone could explain that... @Expertium

Could one simply use the MaiMemo data with the FSRS 4.5 algorithm? @L-M-Sherlock

L-M-Sherlock commented 10 months ago

I'm a little confused now since I realize I don't know the formal relationship of FSRS 4.5 and SSP-MMC

They are all based on DSR model. But the difficulty of cards is predetermined because we have millions users learning the same set of vocabulary.

Could one simply use the MaiMemo data with the FSRS 4.5 algorithm?

It's hard because the MaiMemo data doesn't contains every user's entire review data.

imrryr commented 10 months ago

@L-M-Sherlock OK, got it. So, DSR= difficulty, stability, recall... So when I unpack the SSP-MMC notation in your paper I will see it corresponds closely with FSRS model, except the difficulties are fixed in SSP-MMC method? Also, I got the full data, so I may have more questions as I move forward on this with Hannah

Expertium commented 10 months ago

My bad, imrryr. All this time I thought you were the person who is implementing a decision tree algorithm. @maxencefrenette any progress?

imrryr commented 10 months ago

@L-M-Sherlock I am looking at the revlog format in the data archive. Do you have existing code to convert it to your CSV format? I guess I need to do that.

L-M-Sherlock commented 10 months ago

Do you have existing code to convert it to your CSV format? I guess I need to do that.

Do you mean this?

https://github.com/open-spaced-repetition/fsrs-optimizer/blob/8ce183629bdd56cf6a4eced66df121caecaef92e/src/fsrs_optimizer/fsrs_optimizer.py#L476-L693

imrryr commented 10 months ago

@L-M-Sherlock Maybe I do, but the format this code creates is different than is in the dataset folder. Do you know how to make them into the same format it needs for input: e.g.

card_id,review_th,delta_t,rating 0,1,-1,3 0,2,0,3 0,3,4,3

Can you elaborate on how to get to this final format? I may be able to right the code from what you sent already, but help is appreciated.

Also review_th - this is the order the cards occurred in? delta_t - this is the difference in the cards temporal spacings (with 0 indicating less than a day)?

L-M-Sherlock commented 10 months ago

Can you elaborate on how to get to this final format? I may be able to right the code from what you sent already, but help is appreciated.

The code used to generate that format data is at here: https://github.com/open-spaced-repetition/fsrs-benchmark/blob/main/revlogs2dataset.py

imrryr commented 10 months ago

So, this code seemed to work at first, but doesn't produce the same results as the tiny dataset had. Its weirdly similar, with the number of card_id and length the same... just corrupted review_th and delta t.... For example... correct file: card_id,review_th,delta_t,rating 0,1,-1,3 0,2,0,3 0,3,4,3 0,163,6,4 0,237,1,2 0,380,11,4 1,4,-1,3 1,14,0,1 1,16,0,1 1,21,0,3 1,30,0,3 1,111,2,3 1,160,4,4 1,340,8,3

the output I get from revlogs2dataset.py: card_id,review_th,delta_t,rating card_id,review_th,delta_t,rating 0,4863,-1,3 0,4864,0,3 0,4997,4,3 0,5846,5,4 0,6105,2,2 0,6745,10,4 1,4998,-1,3 1,5008,0,1 1,5010,0,1 1,5015,0,3 1,5024,0,3 1,5276,1,3 1,5843,4,4 1,6371,9,3

L-M-Sherlock commented 10 months ago

So, this code seemed to work at first, but doesn't produce the same results as the tiny dataset had.

Please open a new issue to report the details. I hope you can share the revlogs file and your script code.

Expertium commented 9 months ago

Well that's a bummer. Why did you close it?

L-M-Sherlock commented 9 months ago

Because I don't plan to implement the model and I have shared the dataset with the creator of this issue.

Expertium commented 9 months ago

Yeah, but did the creator of the issue himself say that he's not planning to work on it?

maxencefrenette commented 9 months ago

Hi all, I'm still working on this, but progress is slow since I don't have a ton of time to spend on this. I got what I wanted out of this issue, which is a public subset of the data, thanks a lot for that. I'm okay with closing this, I don't need the issue to be open to work on it.

Expertium commented 9 months ago

@maxencefrenette I think it's best to keep the number of trainable parameters around 500-600, since that's roughly how many parameters our LSTM and Transformer have. Ideally, we want to see how much architecture affects the results. If the number of parameters across different algorithms is similar, then we can clearly see which architecture is superior.

Expertium commented 9 months ago

@maxencefrenette Hello again! Me and LMSherlock have re-defined RMSE and are finishing benchmarking algorithms again. If you still want to participate (and I hope you do), now is a good time.