[TODO] Add DASH and its variants

L-M-Sherlock commented 8 months ago

Jones, M. N. (Ed.). (2016). Predicting and Improving Memory Retention: Psychological Theory Matters in the Big Data Era. In Big Data in Cognitive Science (0 ed., pp. 43–73). Psychology Press. https://doi.org/10.4324/9781315413570-8

Randazzo, Giacomo. (2020-21). Memory Models for Spaced Repetition Systems (Tesi di Laurea Magistrale in Mathematical Engineering - Ingegneria Matematica, Politecnico di Milano). Advisor: Marco D. Santambrogio. Retrieved from https://hdl.handle.net/10589/186407

Expertium commented 8 months ago

Interesting. But I suggest working on this issue first, ACT-R seems to be simpler.

L-M-Sherlock commented 8 months ago

According to my research, the basic DASH is very simple. I will take a look at the ACT-R tomorrow.

Expertium commented 8 months ago

Now I'm curious whether DASH will be cheating. I'm looking forward to seeing the graphs!

L-M-Sherlock commented 8 months ago

Initial results:

Model: DASH
Total number of users: 537
Total number of reviews: 18685690
Weighted average by reviews:
DASH LogLoss (mean±std): 0.337±0.155
DASH RMSE(bins) (mean±std): 0.049±0.039

Weighted average by log(reviews):
DASH LogLoss (mean±std): 0.377±0.162
DASH RMSE(bins) (mean±std): 0.078±0.058

Weighted average by users:
DASH LogLoss (mean±std): 0.383±0.162
DASH RMSE(bins) (mean±std): 0.084±0.062

Model: FSRS-4.5
Total number of users: 537
Total number of reviews: 18685690
Weighted average by reviews:
FSRS-4.5 LogLoss (mean±std): 0.318±0.153
FSRS-4.5 RMSE(bins) (mean±std): 0.041±0.031

Weighted average by log(reviews):
FSRS-4.5 LogLoss (mean±std): 0.346±0.162
FSRS-4.5 RMSE(bins) (mean±std): 0.062±0.043

Weighted average by users:
FSRS-4.5 LogLoss (mean±std): 0.348±0.163
FSRS-4.5 RMSE(bins) (mean±std): 0.065±0.045

weights: [0.5614, 1.4046, 3.8707, 10.3723, 5.1491, 1.2271, 0.8804, 0.0465, 1.6598, 0.1405, 1.0407, 2.1135, 0.0886, 0.3247, 1.4143, 0.2151, 2.8857]
Model: FSRSv4
Total number of users: 537
Total number of reviews: 18685690
Weighted average by reviews:
FSRSv4 LogLoss (mean±std): 0.322±0.157
FSRSv4 RMSE(bins) (mean±std): 0.049±0.037

Weighted average by log(reviews):
FSRSv4 LogLoss (mean±std): 0.353±0.169
FSRSv4 RMSE(bins) (mean±std): 0.073±0.051

Weighted average by users:
FSRSv4 LogLoss (mean±std): 0.357±0.171
FSRSv4 RMSE(bins) (mean±std): 0.077±0.052

The calibration graph:

DASH.zip

By the way, it's pretty fast. It only costs 5 minutes to optimize 350 collections.

It's time to sleep in China. Good night.

Expertium commented 8 months ago

x = torch.log(x + 1) If x can be small, then it's better to use torch.log1p(x) to avoid the loss of precision. Btw, I'm assuming this is the simplest version of DASH, not DASH[MCM] and not DASH[ACT-R]? EDIT: your code doesn't really look like DASH. But I'm not sure, I find these formulas to be very difficult to read.

L-M-Sherlock commented 8 months ago

If x can be small, then it's better to use torch.log1p(x) to avoid the loss of precision.

x is non-negative integer.

I'm assuming this is the simplest version of DASH, not DASH[MCM] and not DASH[ACT-R]?

Yeah. I will implement the DASH[MCM] and DASH[ACT-R] later.

your code doesn't really look like DASH.

The equation is very complicated. But I'm sure my code is correct. I just merged the $a_s$ and $d_c$ into the bias item of the linear layer and removed the first time windows.

L-M-Sherlock commented 8 months ago

@giacomoran, sorry for bothering you. Could you share your code of the DASH[MCM] and DASH[ACT-R] models? I know you compared them with your R-17 and DASH[RNN] models. I also want to compare them with FSRS. I have implemented DASH.

Edit: I guess I have figured out the implementation of DASH[MCM]. The only one difference between DASH and DASH[MCM] is the time windows features:

def dash_tw_features_optimized_no_accumulator(r_history, t_history, enable_decay=False):
    features = np.zeros(8)
    r_history = np.array(r_history) > 1
    tau_w = np.array([0.2434, 1.9739, 16.0090, 129.8426])
    time_windows = np.array([1, 7, 30, np.inf])

    # Compute the cumulative sum of t_history in reverse order
    cumulative_times = np.cumsum(t_history[::-1])[::-1]

    for j, time_window in enumerate(time_windows):
        # Calculate decay factors for each time window
        if enable_decay:
            decay_factors = np.exp(-cumulative_times / tau_w[j])
        else:
            decay_factors = np.ones_like(cumulative_times)

        # Identify the indices where cumulative times are within the current time window
        valid_indices = cumulative_times <= time_window

        # Update features using decay factors where valid
        features[j * 2] += np.sum(decay_factors[valid_indices])
        features[j * 2 + 1] += np.sum(r_history[valid_indices] * decay_factors[valid_indices])

    return features

r_history = [1, 4, 3, 2, 1, 3]
t_history = [4, 4, 15, 10, 1, 3]
features = dash_tw_features(r_history, t_history, delta_t, True)
print(features)
features = dash_tw_features(r_history, t_history, delta_t, False)
print(features)

[0.01643301 0.01643301 0.8137531  0.73433718 2.99542927 2.2636851
 6.12471083 4.41621267]
[1. 1. 3. 2. 5. 4. 7. 5.]

When enable_decay is True, the features are for DASH[MCM].

L-M-Sherlock commented 8 months ago

Try to implement the DASH[ACT-R]. The initial weights are arbitrary.

class DASH_ACTR(nn.Module):
    init_w = [1, 1, 1, 1, 1]

    def __init__(self, w=init_w):
        super(DASH_ACTR, self).__init__()
        self.w = nn.Parameter(torch.tensor(w, dtype=torch.float32))
        self.sigmoid = nn.Sigmoid()

    def forward(self, inputs):
        """
        :param inputs: shape[seq_len, batch_size, 2], 2 means r and t
        """
        return self.sigmoid(self.w[0] * torch.log(
            1 + torch.sum(inputs[:, :, 1] ** -self.w[1] *
                          torch.where(inputs[:, :, 0] == 0, self.w[2], self.w[3]))
        ) + self.w[4])

t_history = [0, 1, 2, 4, 8]
r_history = [0, 1, 0, 1, 0]
delta_t = 1

t_history = torch.tensor(t_history[1:] + [delta_t], dtype=torch.float32)
cumsum = torch.cumsum(t_history, dim=0)

inputs = torch.tensor([r_history, t_history - cumsum + cumsum[-1:None]], dtype=torch.float32).transpose(0, 1)
inputs = inputs.unsqueeze(1)
print(inputs)
model = DASH_ACTR()
output = model(inputs)
print(output.item())

tensor([[[ 0., 16.]],

        [[ 1., 15.]],

        [[ 0., 13.]],

        [[ 1.,  9.]],

        [[ 0.,  1.]]])
torch.Size([5, 1, 2])
0.862991213798523

giacomoran commented 8 months ago

I'm not 100% sure the following code is the most up-to-date and the one I've used for the thesis results. I should have kept the code and repo cleaner... Hopefully it will be useful anyways.

I implemented DASH[MCM] in R by casting it as a Generalized Additive Model (GAM) and using the bam function from mgcv package.

pop_mean <- mean(data_train$ratingB)

cntSeen <- function(intervalDays, threshold, tau) {
  N <- length(intervalDays)
  ret <- rep(0, length(intervalDays))
  for (n in 1:N) {
    # head(intervalDays, n)                    1 ,  4, 15
    # rev(head(intervalDays, n))               15,  4,  1
    ts <- cumsum(rev(head(intervalDays, n))) # 15, 19, 20
    ts <- ts[ts <= threshold]
    ret[n] <- sum(map_dbl(ts, ~ exp(- . / tau)))
  }
  ret
}

cntCorrect <- function(intervalDays, rating, threshold, tau) {
  N <- length(intervalDays)
  ret <- rep(0, length(intervalDays))
  for (n in 1:N) {
    if (n == 1) {
      rs <- c(T)
    } else {      
      rs <- c(T, head(rating, n-1) == "GOOD")
    }
    ts <- cumsum(rev(head(intervalDays, n)))
    ts <- ts[rs & ts <= threshold]
    ret[n] <- sum(map_dbl(ts, ~ exp(- . / tau)))
  }
  ret
}

tau_1 <- 0.2434
tau_7 <- 1.9739
tau_30 <- 16.0090
tau_infty <- 129.8426

data_train_mcm <- 
  data_train %>% 
  group_by(idHistory) %>%
  mutate(
    cntDaySeen = log1p(cntSeen(intervalDays, 1, tau_1)),
    cntDayCorrect = log1p(cntCorrect(intervalDays, rating, 1, tau_1)),
    cntWeekSeen = log1p(cntSeen(intervalDays, 7, tau_7)),
    cntWeekCorrect = log1p(cntCorrect(intervalDays, rating, 7, tau_7)),
    cntMonthSeen = log1p(cntSeen(intervalDays, 30, tau_30)),
    cntMonthCorrect = log1p(cntCorrect(intervalDays, rating, 30, tau_30)),
    cntHistorySeen = log1p(cntSeen(intervalDays, 10000, tau_infty)),
    cntHistoryCorrect = log1p(cntCorrect(intervalDays, rating, 10000, tau_infty))
  ) %>%
  ungroup()

fit_data_mcm <- bam(rating ~ -1 + idPrompt + s(idUser, bs="re") +
                             cntHistorySeen + cntHistoryCorrect +
                             cntDaySeen + cntDayCorrect + 
                             cntWeekSeen + cntWeekCorrect + 
                             cntMonthSeen + cntMonthCorrect,
                    family="binomial",
                    data=data_train_mcm,
                    cluster=cluster)

As for DASH[ACT-R], I implemented it in Python using Keras because of the tricky paramenter appearing as exponent.

class DASHACTR_review(layers.Layer):
    def __init__(self, **kwargs):
      super(DASHACTR_review, self).__init__(**kwargs)

      self.theta_2 = tf.Variable(initial_value=1., trainable=True, name="theta_2", constraint=tf.keras.constraints.NonNeg())
      self.theta_3 = tf.Variable(initial_value=1., trainable=True, name="theta_3")
      self.theta_4 = tf.Variable(initial_value=1., trainable=True, name="theta_4")

    def compute_output_shape(self, input_shape):
        return input_shape[:-1] + (1, )

    def call(self, inputs):
      input_delta, inputs_r = tf.split(inputs, [1, 1], axis=-1)

      return tf.math.pow(input_delta, (-self.theta_2) * (2 * tf.math.sign(input_delta) - 1)) * ( self.theta_3 + inputs_r * self.theta_4)

    def get_config(self):
      return {}

class DASHACTR_history(layers.Layer):
    def __init__(self, **kwargs):
      super(DASHACTR_history, self).__init__(**kwargs)

    def compute_output_shape(self, input_shape):
      return input_shape[:-2] + (1, )

    def call(self, inputs):
      return tf.math.log1p(tf.math.reduce_sum(inputs, axis=-1))

    def get_config(self):
      return {}

inputs_sequences = layers.Input(shape=(max_history_length, 2), name="inputs_sequences")

h_1 = layers.TimeDistributed(DASHACTR_review())(inputs_sequences)
h = DASHACTR_history()(h_1)

input_user = layers.Input(shape=(1), name="input_user", dtype="string")
layer_onehot_user = tf.keras.layers.StringLookup(output_mode='one_hot')
layer_onehot_user.adapt(x_train_user)
onehot_user = layer_onehot_user(input_user)

input_card = layers.Input(shape=(1), name="input_card", dtype="string")
layer_onehot_card = tf.keras.layers.StringLookup(output_mode='one_hot')
layer_onehot_card.adapt(x_train_card)
onehot_card = layer_onehot_card(input_card)

concatenated = layers.concatenate([h, onehot_user, onehot_card])

output = layers.Dense(1, activation='sigmoid', use_bias=False, kernel_regularizer=tf.keras.regularizers.l2(1e-4), name="sigmoid_out")(concatenated)

model_0 = keras.Model(inputs=[inputs_sequences, input_user, input_card], outputs=output, name="model_dash_act_r")

L-M-Sherlock commented 8 months ago

return tf.math.pow(input_delta, (-self.theta_2) (2 tf.math.sign(input_delta) - 1)) ( self.theta_3 + inputs_r self.theta_4)

@giacomoran I think this line of code is inconsistent with the equation 2.12:

By the way, I find that this item could be negative if $\theta_3$ is negative. So the ln will run into math error.

L-M-Sherlock commented 8 months ago

The initial results of DASH[ACT-R]:

Model: DASH[ACT-R]
Total number of users: 191
Total number of reviews: 5847195
Weighted average by reviews:
DASH[ACT-R] LogLoss (mean±std): 0.318±0.170
DASH[ACT-R] RMSE(bins) (mean±std): 0.039±0.032

Weighted average by log(reviews):
DASH[ACT-R] LogLoss (mean±std): 0.360±0.169
DASH[ACT-R] RMSE(bins) (mean±std): 0.057±0.050

Weighted average by users:
DASH[ACT-R] LogLoss (mean±std): 0.362±0.171
DASH[ACT-R] RMSE(bins) (mean±std): 0.060±0.053

weights: [1.5332, 0.4815, -0.452, 2.0, 1.0422]
Model: FSRS-4.5
Total number of users: 191
Total number of reviews: 5847195
Weighted average by reviews:
FSRS-4.5 LogLoss (mean±std): 0.310±0.168
FSRS-4.5 RMSE(bins) (mean±std): 0.044±0.032

Weighted average by log(reviews):
FSRS-4.5 LogLoss (mean±std): 0.351±0.160
FSRS-4.5 RMSE(bins) (mean±std): 0.064±0.044

Weighted average by users:
FSRS-4.5 LogLoss (mean±std): 0.354±0.160
FSRS-4.5 RMSE(bins) (mean±std): 0.067±0.046

weights: [0.5441, 1.4455, 3.8863, 11.5647, 5.1589, 1.2303, 0.8881, 0.0465, 1.629, 0.1588, 1.019, 2.1135, 0.0928, 0.337, 1.3907, 0.2225, 2.9044]
Model: DASH
Total number of users: 191
Total number of reviews: 5847195
Weighted average by reviews:
DASH LogLoss (mean±std): 0.333±0.160
DASH RMSE(bins) (mean±std): 0.058±0.057

Weighted average by log(reviews):
DASH LogLoss (mean±std): 0.382±0.156
DASH RMSE(bins) (mean±std): 0.081±0.058

Weighted average by users:
DASH LogLoss (mean±std): 0.386±0.157
DASH RMSE(bins) (mean±std): 0.085±0.060

The calibration graphs: DASH[ACT-R].zip

@Expertium, could you check them when you're available?

Expertium commented 8 months ago

Graphs look good to me. EDIT: actually, I'm not so sure. We really need some sort of quantitative measure of cheating. EDIT 2: @L-M-Sherlock https://github.com/open-spaced-repetition/fsrs-benchmark/issues/57

giacomoran commented 8 months ago

@giacomoran I think this line of code is inconsistent with the equation 2.12

Yeah, that looks like a mistake, I don't know what I was thinking.

open-spaced-repetition / srs-benchmark

[TODO] Add DASH and its variants #51