Closed L-M-Sherlock closed 8 months ago
Interesting. But I suggest working on this issue first, ACT-R seems to be simpler.
According to my research, the basic DASH is very simple. I will take a look at the ACT-R tomorrow.
Now I'm curious whether DASH will be cheating. I'm looking forward to seeing the graphs!
Initial results:
Model: DASH
Total number of users: 537
Total number of reviews: 18685690
Weighted average by reviews:
DASH LogLoss (mean±std): 0.337±0.155
DASH RMSE(bins) (mean±std): 0.049±0.039
Weighted average by log(reviews):
DASH LogLoss (mean±std): 0.377±0.162
DASH RMSE(bins) (mean±std): 0.078±0.058
Weighted average by users:
DASH LogLoss (mean±std): 0.383±0.162
DASH RMSE(bins) (mean±std): 0.084±0.062
Model: FSRS-4.5
Total number of users: 537
Total number of reviews: 18685690
Weighted average by reviews:
FSRS-4.5 LogLoss (mean±std): 0.318±0.153
FSRS-4.5 RMSE(bins) (mean±std): 0.041±0.031
Weighted average by log(reviews):
FSRS-4.5 LogLoss (mean±std): 0.346±0.162
FSRS-4.5 RMSE(bins) (mean±std): 0.062±0.043
Weighted average by users:
FSRS-4.5 LogLoss (mean±std): 0.348±0.163
FSRS-4.5 RMSE(bins) (mean±std): 0.065±0.045
weights: [0.5614, 1.4046, 3.8707, 10.3723, 5.1491, 1.2271, 0.8804, 0.0465, 1.6598, 0.1405, 1.0407, 2.1135, 0.0886, 0.3247, 1.4143, 0.2151, 2.8857]
Model: FSRSv4
Total number of users: 537
Total number of reviews: 18685690
Weighted average by reviews:
FSRSv4 LogLoss (mean±std): 0.322±0.157
FSRSv4 RMSE(bins) (mean±std): 0.049±0.037
Weighted average by log(reviews):
FSRSv4 LogLoss (mean±std): 0.353±0.169
FSRSv4 RMSE(bins) (mean±std): 0.073±0.051
Weighted average by users:
FSRSv4 LogLoss (mean±std): 0.357±0.171
FSRSv4 RMSE(bins) (mean±std): 0.077±0.052
The calibration graph:
By the way, it's pretty fast. It only costs 5 minutes to optimize 350 collections.
It's time to sleep in China. Good night.
x = torch.log(x + 1)
If x can be small, then it's better to use torch.log1p(x)
to avoid the loss of precision. Btw, I'm assuming this is the simplest version of DASH, not DASH[MCM] and not DASH[ACT-R]?
EDIT: your code doesn't really look like DASH. But I'm not sure, I find these formulas to be very difficult to read.
If x can be small, then it's better to use
torch.log1p(x)
to avoid the loss of precision.
x
is non-negative integer.
I'm assuming this is the simplest version of DASH, not DASH[MCM] and not DASH[ACT-R]?
Yeah. I will implement the DASH[MCM] and DASH[ACT-R] later.
your code doesn't really look like DASH.
The equation is very complicated. But I'm sure my code is correct. I just merged the $a_s$ and $d_c$ into the bias item of the linear layer and removed the first time windows.
@giacomoran, sorry for bothering you. Could you share your code of the DASH[MCM] and DASH[ACT-R] models? I know you compared them with your R-17 and DASH[RNN] models. I also want to compare them with FSRS. I have implemented DASH.
Edit: I guess I have figured out the implementation of DASH[MCM]. The only one difference between DASH and DASH[MCM] is the time windows features:
def dash_tw_features_optimized_no_accumulator(r_history, t_history, enable_decay=False):
features = np.zeros(8)
r_history = np.array(r_history) > 1
tau_w = np.array([0.2434, 1.9739, 16.0090, 129.8426])
time_windows = np.array([1, 7, 30, np.inf])
# Compute the cumulative sum of t_history in reverse order
cumulative_times = np.cumsum(t_history[::-1])[::-1]
for j, time_window in enumerate(time_windows):
# Calculate decay factors for each time window
if enable_decay:
decay_factors = np.exp(-cumulative_times / tau_w[j])
else:
decay_factors = np.ones_like(cumulative_times)
# Identify the indices where cumulative times are within the current time window
valid_indices = cumulative_times <= time_window
# Update features using decay factors where valid
features[j * 2] += np.sum(decay_factors[valid_indices])
features[j * 2 + 1] += np.sum(r_history[valid_indices] * decay_factors[valid_indices])
return features
r_history = [1, 4, 3, 2, 1, 3]
t_history = [4, 4, 15, 10, 1, 3]
features = dash_tw_features(r_history, t_history, delta_t, True)
print(features)
features = dash_tw_features(r_history, t_history, delta_t, False)
print(features)
[0.01643301 0.01643301 0.8137531 0.73433718 2.99542927 2.2636851
6.12471083 4.41621267]
[1. 1. 3. 2. 5. 4. 7. 5.]
When enable_decay
is True, the features are for DASH[MCM].
Try to implement the DASH[ACT-R]. The initial weights are arbitrary.
class DASH_ACTR(nn.Module):
init_w = [1, 1, 1, 1, 1]
def __init__(self, w=init_w):
super(DASH_ACTR, self).__init__()
self.w = nn.Parameter(torch.tensor(w, dtype=torch.float32))
self.sigmoid = nn.Sigmoid()
def forward(self, inputs):
"""
:param inputs: shape[seq_len, batch_size, 2], 2 means r and t
"""
return self.sigmoid(self.w[0] * torch.log(
1 + torch.sum(inputs[:, :, 1] ** -self.w[1] *
torch.where(inputs[:, :, 0] == 0, self.w[2], self.w[3]))
) + self.w[4])
t_history = [0, 1, 2, 4, 8]
r_history = [0, 1, 0, 1, 0]
delta_t = 1
t_history = torch.tensor(t_history[1:] + [delta_t], dtype=torch.float32)
cumsum = torch.cumsum(t_history, dim=0)
inputs = torch.tensor([r_history, t_history - cumsum + cumsum[-1:None]], dtype=torch.float32).transpose(0, 1)
inputs = inputs.unsqueeze(1)
print(inputs)
model = DASH_ACTR()
output = model(inputs)
print(output.item())
tensor([[[ 0., 16.]],
[[ 1., 15.]],
[[ 0., 13.]],
[[ 1., 9.]],
[[ 0., 1.]]])
torch.Size([5, 1, 2])
0.862991213798523
I'm not 100% sure the following code is the most up-to-date and the one I've used for the thesis results. I should have kept the code and repo cleaner... Hopefully it will be useful anyways.
I implemented DASH[MCM] in R by casting it as a Generalized Additive Model (GAM) and using the bam function from mgcv package.
pop_mean <- mean(data_train$ratingB)
cntSeen <- function(intervalDays, threshold, tau) {
N <- length(intervalDays)
ret <- rep(0, length(intervalDays))
for (n in 1:N) {
# head(intervalDays, n) 1 , 4, 15
# rev(head(intervalDays, n)) 15, 4, 1
ts <- cumsum(rev(head(intervalDays, n))) # 15, 19, 20
ts <- ts[ts <= threshold]
ret[n] <- sum(map_dbl(ts, ~ exp(- . / tau)))
}
ret
}
cntCorrect <- function(intervalDays, rating, threshold, tau) {
N <- length(intervalDays)
ret <- rep(0, length(intervalDays))
for (n in 1:N) {
if (n == 1) {
rs <- c(T)
} else {
rs <- c(T, head(rating, n-1) == "GOOD")
}
ts <- cumsum(rev(head(intervalDays, n)))
ts <- ts[rs & ts <= threshold]
ret[n] <- sum(map_dbl(ts, ~ exp(- . / tau)))
}
ret
}
tau_1 <- 0.2434
tau_7 <- 1.9739
tau_30 <- 16.0090
tau_infty <- 129.8426
data_train_mcm <-
data_train %>%
group_by(idHistory) %>%
mutate(
cntDaySeen = log1p(cntSeen(intervalDays, 1, tau_1)),
cntDayCorrect = log1p(cntCorrect(intervalDays, rating, 1, tau_1)),
cntWeekSeen = log1p(cntSeen(intervalDays, 7, tau_7)),
cntWeekCorrect = log1p(cntCorrect(intervalDays, rating, 7, tau_7)),
cntMonthSeen = log1p(cntSeen(intervalDays, 30, tau_30)),
cntMonthCorrect = log1p(cntCorrect(intervalDays, rating, 30, tau_30)),
cntHistorySeen = log1p(cntSeen(intervalDays, 10000, tau_infty)),
cntHistoryCorrect = log1p(cntCorrect(intervalDays, rating, 10000, tau_infty))
) %>%
ungroup()
fit_data_mcm <- bam(rating ~ -1 + idPrompt + s(idUser, bs="re") +
cntHistorySeen + cntHistoryCorrect +
cntDaySeen + cntDayCorrect +
cntWeekSeen + cntWeekCorrect +
cntMonthSeen + cntMonthCorrect,
family="binomial",
data=data_train_mcm,
cluster=cluster)
As for DASH[ACT-R], I implemented it in Python using Keras because of the tricky paramenter appearing as exponent.
class DASHACTR_review(layers.Layer):
def __init__(self, **kwargs):
super(DASHACTR_review, self).__init__(**kwargs)
self.theta_2 = tf.Variable(initial_value=1., trainable=True, name="theta_2", constraint=tf.keras.constraints.NonNeg())
self.theta_3 = tf.Variable(initial_value=1., trainable=True, name="theta_3")
self.theta_4 = tf.Variable(initial_value=1., trainable=True, name="theta_4")
def compute_output_shape(self, input_shape):
return input_shape[:-1] + (1, )
def call(self, inputs):
input_delta, inputs_r = tf.split(inputs, [1, 1], axis=-1)
return tf.math.pow(input_delta, (-self.theta_2) * (2 * tf.math.sign(input_delta) - 1)) * ( self.theta_3 + inputs_r * self.theta_4)
def get_config(self):
return {}
class DASHACTR_history(layers.Layer):
def __init__(self, **kwargs):
super(DASHACTR_history, self).__init__(**kwargs)
def compute_output_shape(self, input_shape):
return input_shape[:-2] + (1, )
def call(self, inputs):
return tf.math.log1p(tf.math.reduce_sum(inputs, axis=-1))
def get_config(self):
return {}
inputs_sequences = layers.Input(shape=(max_history_length, 2), name="inputs_sequences")
h_1 = layers.TimeDistributed(DASHACTR_review())(inputs_sequences)
h = DASHACTR_history()(h_1)
input_user = layers.Input(shape=(1), name="input_user", dtype="string")
layer_onehot_user = tf.keras.layers.StringLookup(output_mode='one_hot')
layer_onehot_user.adapt(x_train_user)
onehot_user = layer_onehot_user(input_user)
input_card = layers.Input(shape=(1), name="input_card", dtype="string")
layer_onehot_card = tf.keras.layers.StringLookup(output_mode='one_hot')
layer_onehot_card.adapt(x_train_card)
onehot_card = layer_onehot_card(input_card)
concatenated = layers.concatenate([h, onehot_user, onehot_card])
output = layers.Dense(1, activation='sigmoid', use_bias=False, kernel_regularizer=tf.keras.regularizers.l2(1e-4), name="sigmoid_out")(concatenated)
model_0 = keras.Model(inputs=[inputs_sequences, input_user, input_card], outputs=output, name="model_dash_act_r")
return tf.math.pow(input_delta, (-self.theta_2) (2 tf.math.sign(input_delta) - 1)) ( self.theta_3 + inputs_r self.theta_4)
@giacomoran I think this line of code is inconsistent with the equation 2.12:
By the way, I find that this item could be negative if $\theta_3$ is negative. So the ln
will run into math error.
The initial results of DASH[ACT-R]:
Model: DASH[ACT-R]
Total number of users: 191
Total number of reviews: 5847195
Weighted average by reviews:
DASH[ACT-R] LogLoss (mean±std): 0.318±0.170
DASH[ACT-R] RMSE(bins) (mean±std): 0.039±0.032
Weighted average by log(reviews):
DASH[ACT-R] LogLoss (mean±std): 0.360±0.169
DASH[ACT-R] RMSE(bins) (mean±std): 0.057±0.050
Weighted average by users:
DASH[ACT-R] LogLoss (mean±std): 0.362±0.171
DASH[ACT-R] RMSE(bins) (mean±std): 0.060±0.053
weights: [1.5332, 0.4815, -0.452, 2.0, 1.0422]
Model: FSRS-4.5
Total number of users: 191
Total number of reviews: 5847195
Weighted average by reviews:
FSRS-4.5 LogLoss (mean±std): 0.310±0.168
FSRS-4.5 RMSE(bins) (mean±std): 0.044±0.032
Weighted average by log(reviews):
FSRS-4.5 LogLoss (mean±std): 0.351±0.160
FSRS-4.5 RMSE(bins) (mean±std): 0.064±0.044
Weighted average by users:
FSRS-4.5 LogLoss (mean±std): 0.354±0.160
FSRS-4.5 RMSE(bins) (mean±std): 0.067±0.046
weights: [0.5441, 1.4455, 3.8863, 11.5647, 5.1589, 1.2303, 0.8881, 0.0465, 1.629, 0.1588, 1.019, 2.1135, 0.0928, 0.337, 1.3907, 0.2225, 2.9044]
Model: DASH
Total number of users: 191
Total number of reviews: 5847195
Weighted average by reviews:
DASH LogLoss (mean±std): 0.333±0.160
DASH RMSE(bins) (mean±std): 0.058±0.057
Weighted average by log(reviews):
DASH LogLoss (mean±std): 0.382±0.156
DASH RMSE(bins) (mean±std): 0.081±0.058
Weighted average by users:
DASH LogLoss (mean±std): 0.386±0.157
DASH RMSE(bins) (mean±std): 0.085±0.060
The calibration graphs: DASH[ACT-R].zip
@Expertium, could you check them when you're available?
Graphs look good to me. EDIT: actually, I'm not so sure. We really need some sort of quantitative measure of cheating. EDIT 2: @L-M-Sherlock https://github.com/open-spaced-repetition/fsrs-benchmark/issues/57
@giacomoran I think this line of code is inconsistent with the equation 2.12
Yeah, that looks like a mistake, I don't know what I was thinking.
Jones, M. N. (Ed.). (2016). Predicting and Improving Memory Retention: Psychological Theory Matters in the Big Data Era. In Big Data in Cognitive Science (0 ed., pp. 43–73). Psychology Press. https://doi.org/10.4324/9781315413570-8
Randazzo, Giacomo. (2020-21). Memory Models for Spaced Repetition Systems (Tesi di Laurea Magistrale in Mathematical Engineering - Ingegneria Matematica, Politecnico di Milano). Advisor: Marco D. Santambrogio. Retrieved from https://hdl.handle.net/10589/186407