twchen / lessr

Handling Information Loss of Graph Neural Networks for Session-based Recommendation
MIT License
65 stars 19 forks source link

This code chooses best model on test set instead of validation set. #2

Closed zzxn closed 3 years ago

zzxn commented 3 years ago

At utils/train.py, the train process returns the best MRR and HIT on TEST SET, RESPECTIVELY:

def train(self, epochs, log_interval=100):
    # ...
    for epoch in range(epochs):
        self.model.train()
        for batch in self.train_loader:
        # ...
        mrr, hit = evaluate(self.model, self.test_loader, self.device)  # mrr and hit on test set
        # ...
        max_mrr = max(max_mrr, mrr)  # best mrr on test set
        max_hit = max(max_hit, hit)  # best hit on test set
        self.epoch += 1
    return max_mrr, max_hit  # <- It's NOT OK to return best mrr and hit on TEST SET, RESPECTIVELY

I believe this code is not the final version, please update it to the code used in your paper.

twchen commented 3 years ago

This is actually a common practice in session-based recommendation. The same setting was adopted in some previous works such as STAMP, SR-GNN, and FGNN, and in some recently published works such as DHCN and TAGNN. We could even find the same setting in other fields of recommender systems, such as LightGCN and NGCF.

Despite the popularity of our setting, I agree with you that in principle, we should choose the best model on the validation set and evaluate the model on the test set. The common approach is to first extract a most recent fraction of the data as the test set, and then extract a most recent fraction of the remaining data as the validation set. But this approach may also have a problem, i.e., the distributions of the validation set and the test set are different because their collection times are different, so the performance on the validation set does not reveal the true performance on the test set. I guess this is part of the reason some existing methods did not adopt the correct-in-principle approach. I think a better approach is to first extract a most recent fraction of the data and then randomly split it into the validation and test sets. But I did not think of this approach when we published this work.

In conclusion, I think our setting is OK because it is very common. Besides, it is used for all models in our experiments. The comparison of models' performance is fair. You could update our code to use the validation set to select the best model. It should be quite easy. However, we would not update our code to make it consistent with our paper.

zzxn commented 3 years ago

Still confused by this common practice, but thanks for your reply.