open-spaced-repetition / fsrs-rs

FSRS for Rust, including Optimizer and Scheduler
https://crates.io/crates/fsrs
BSD 3-Clause "New" or "Revised" License
157 stars 17 forks source link

[TODO] feature: build dataset from Anki's sqlite database #3

Closed L-M-Sherlock closed 1 year ago

L-M-Sherlock commented 1 year ago

Python implementation: https://github.com/open-spaced-repetition/fsrs-optimizer/blob/95694b787bb71ac9883db1201af09e334ee5ee0b/src/fsrs_optimizer/fsrs_optimizer.py#L319-L449

L-M-Sherlock commented 1 year ago

DataFrame library for Rust: https://docs.rs/polars/latest/polars/

dae commented 1 year ago

I wonder if this could perhaps be accomplished using standard Rust iterators/structures and SQL? Eg fetching the data from the revlog table in the desired order, and collecting it into a Vec<(CardId, Vec)> or similar. Upsides would be avoiding the addition of another non-trivial dependency, and the use of typed structs would make the code somewhat more maintainable than looking up columns by string keys. I'm not sure how easily all of those dataframe calls could be translated though, so don't know if it's practical or not.

L-M-Sherlock commented 1 year ago

OK, I will try. Maybe two main loops will implement it. I use pandas in Python just for efficiency and convenience.

L-M-Sherlock commented 1 year ago

I have tried to query revlog from Anki db in 5a051e9834abfa8309a50e2aadb751c46b0daa3c.

RevlogEntry {
    id: 1528956429777,
    cid: 1528900957309,
    usn: 1282,
    button_chosen: 4,
    interval: 3,
    last_interval: -60,
    ease_factor: 2500,
    taken_millis: 3918,
    review_kind: 0,
}

The next is to rewrite these codes:

df = pd.DataFrame(revlog)
df.columns = ['id', 'cid', 'usn', 'button_chosen', 'interval', 'last_interval', 'ease_factor', 'taken_millis', 'review_kind']

df = df[(df['cid'] <= time.time() * 1000) &
        (df['id'] <= time.time() * 1000)].copy() # remove revlog with incorrect timestamp

df_set_due_date = df[(df['review_kind'] == 4) & (df['interval'] > 0)]
df.drop(df_set_due_date.index, inplace=True) # remove revlog generated by `set due date`

df.sort_values(by=['cid', 'id'], inplace=True, ignore_index=True) # sort revlog by card_id, review_time

df['is_learn_start'] = (df['review_kind'] == 0) & (df['review_kind'].shift() != 0)  # find out the first review for each card
df['sequence_group'] = df['is_learn_start'].cumsum() # if the user never used `forget` in a card, the revlogs from the same card should has the same `sequence_group`
last_learn_start = df[df['is_learn_start']].groupby('cid')['sequence_group'].last() # get the `sequence_group` for each card's `last_learn_start`
df['last_learn_start'] = df['cid'].map(last_learn_start).fillna(0).astype(int)
df['mask'] = df['sequence_group'] >= df['last_learn_start'] # remove the revlogs which happen before the `last_learn_start`
df = df[df['mask'] == True].copy()

df = df[(df['review_kind'] != 4)].copy() # remove revlog generated by  `reschedule_cards_as_new`
df = df[(df['review_kind'] != 3) | (df['ease_factor'] != 0)].copy() # remove revlog in filtered decks with rescheduling disabled
df['review_kind'] = df['review_kind'] + 1
df.loc[df['is_learn_start'], 'review_kind'] = New # New = 0
df.drop(columns=['is_learn_start', 'sequence_group', 'last_learn_start', 'mask', 'usn', 'interval', 'last_interval', 'ease_factor'], inplace=True)
L-M-Sherlock commented 1 year ago

Done in 19dc456680d950191f1c929c89eeed46a00e7314.

@dae, I think the next step is to determine the API?

You can test the training here:

https://github.com/open-spaced-repetition/fsrs-optimizer-burn/blob/19dc456680d950191f1c929c89eeed46a00e7314/src/training.rs#L132-L146

The file collection.anki21 should be placed in the directory tests/data/.

dae commented 1 year ago

You work fast! :-)

I gave this a quick test on a collection with about 200k revlog entries, and it completed in around 90s. I tried a different collection with about 1M entries, and performance did not scale linearly - at the 8 minute mark it was still early on epoch 2. How have you found performance to compare to training with CPU/GPU in PyTorch? If you weren't already aware, using 'cargo test --release' will make sure it's compiled in release mode which should be faster.

In terms of integration into Anki, I imagine we could have a collection method inside the anki crate that dumps the revlog into a vec of FSRSItems (1). We could then call a method exported by this crate that takes a Vec of those items, and it would process it and return the results. We could call it on a background thread, so users could continue to use Anki for most of the training. Users usually won't have access to a console window, so ideally we'd find some way to poll for the current progress/completion ratio, so the UI can show it. If temporary files need to be written out, it would be nice if the desired folder could be passed into the API call. Any config options that might make sense for the user to adjust should also be passed in to the call, so the frontend can expose a UI to adjust them in the future.

How are you feeling about the state of this at this point? Does it look like using burn might be a viable alternative to the current pytorch approach, at least if we can solve the clipping problem in #4?

(1) Re FSRSItem, is the length of t_history and r_history always the same? If so, perhaps it would make things clearer to store it as a reviews: Vec<Review>, where Review is a struct in the form { rating: i32; delta_time: i32; };? Then when it comes time to create a tensor, it could be done like

            Data::new(
                items.iter().map(|i| i.rating).collect(),
                Shape::new([items.len()]),
            )
dae commented 1 year ago

Oh, and I checked the resulting size of a release build, and it's about 11-14MB (including sqlite). Much more viable than tch-rs!

I also played around with the wgpu backend, but it was much slower than ndarray on the Linux and M2 systems I tested with.

dae commented 1 year ago

so ideally we'd find some way to poll for the current progress/completion ratio

If you can provide a synchronous/blocking code example that prints some text for each iteration/epoch, I can convert it so that it writes the progress to a shared data structure protected by a mutex, which we could pass in to the API call.

L-M-Sherlock commented 1 year ago

I gave this a quick test on a collection with about 200k revlog entries, and it completed in around 90s. I tried a different collection with about 1M entries, and performance did not scale linearly - at the 8 minute mark it was still early on epoch 2.

Your observation is consistent with my inference. The performance doesn't scale linearly. The reason is:

  1. For a card with 10 revlogs, the convertor will generate ~10 FSRSItems.
  2. For these FSRSItems, the lengths of t_history/r_history are range from 1 to ~10.
  3. In the training stage, the time complexity is linear to the seq_len.

So, for a card with n revlogs, the time complexity is 1+2+3+...+n = n(n+1)/2, i.e. O(n^2)

Consider two extreme cases: one with a collection with 'n' revlogs where each card has only one revlog, and another with just one card that includes all revlogs. The calculation complexity for these cases would be 'n' and 'n^2', respectively.

L-M-Sherlock commented 1 year ago

In terms of integration into Anki, I imagine we could have a collection method inside the anki crate that dumps the revlog into a vec of FSRSItems (1). We could then call a method exported by this crate that takes a Vec of those items, and it would process it and return the results.

I have implemented a method to convert revlogs to FSRSItems:

https://github.com/open-spaced-repetition/fsrs-optimizer-burn/blob/19dc456680d950191f1c929c89eeed46a00e7314/src/convertor.rs#L210-L221

But it needs to convert revlogs card by card. If the revlogs are incomplete for a card, the data will be invalid and removed.

L-M-Sherlock commented 1 year ago

How are you feeling about the state of this at this point? Does it look like using burn might be a viable alternative to the current pytorch approach, at least if we can solve the clipping problem in #4?

I'm pretty satisfied with burn, except for the over encapsulation of training process. I couldn't implement my own training loop. Fortunately, they plan to support it (https://github.com/burn-rs/burn/issues/662#issuecomment-1684280084).

If it is supported, continuous learning will be feasible. For old collections, the full training only needs to apply once. Then FSRS model could be trained by each review (or a batch of reviews) in the subsequent learning.

L-M-Sherlock commented 1 year ago

(1) Re FSRSItem, is the length of t_history and r_history always the same? If so, perhaps it would make things clearer to store it as a reviews: Vec<Review>, where Review is a struct in the form { rating: i32; delta_time: i32; };? Then when it comes time to create a tensor, it could be done like

Yeah. len(t_history) == len(r_history) is always true. But if we store it as `reviews: Vec<Review>, then we will get a tensor with shape [seq_len, batch_size, 2]. Its dimension is three. I don't know how difficult it is to deal with a 3D tensor in the forward. It's easy in PyTorch, but hard in Burn (partly due to my poor knowledge about Burn and Rust).

https://github.com/open-spaced-repetition/fsrs-optimizer-burn/blob/19dc456680d950191f1c929c89eeed46a00e7314/src/model.rs#L100-L119

dae commented 1 year ago

I may be missing something, but couldn't this perhaps be handled in the mapping of FSRSItem to Tensors? Eg in fn batch(&self, items: Vec<FSRSItem>) -> FSRSBatch<B>, the separate t_historys/r_historys could be created from the single reviews vector, like in https://github.com/open-spaced-repetition/fsrs-optimizer-burn/issues/3#issuecomment-1685159804?

L-M-Sherlock commented 1 year ago

Oh, I forget that. Thanks for reminder. Do you mean to rewrite FSRSItem as the following code?

pub struct FSRSItem {
    pub reviews: Vec<Review>,
    pub delta_t: f32,
    pub label: f32,
}

struct Review { 
    pub rating: i32, 
    pub delta_time: i32
}
dae commented 1 year ago

Yep, exactly. You could call it FSRSReview if you found it clearer.

One question is how FSRS treats reviews in different SRS programs. Does it compare the different rating values, or does it just treat them as true/false? Does it assume the rating will be 1-4?

L-M-Sherlock commented 1 year ago

One question is how FSRS treats reviews in different SRS programs. Does it compare the different rating values, or does it just treat them as true/false? Does it assume the rating will be 1-4?

I have made a draft for the schema of review logs:

https://github.com/open-spaced-repetition/fsrs-optimizer#review-logs-schema

If other SRS programs also use FSRS, it is not a problem to treat different rating values. If they don't we can mapping their ratings values to 1-4. That's what I have done in the comparison between FSRS and SM-15:

https://github.com/open-spaced-repetition/fsrs-vs-sm15#fsrs-vs-sm-15

dae commented 1 year ago

How about we mark this one as done and continue tracking things in #27?