Test data leakage during validation

jlidw commented 6 months ago

Hi authors, thanks for your excellent work and code!

I found a bug in the base_dataset.py, which may significantly affect the model's performance. In the 243-247 lines, when phase is "val", the model will take the test nodes as the targets; however, we should not give any info about test nodes to the validation stage.

Could you please check it again? Thanks.

https://github.com/hjf1997/STGNP/blob/master/data/base_dataset.py#L243-L247

hjf1997 commented 6 months ago

Hi,

I noticed your concern. There are several considerations we used the target nodes for validation eventually. The major reason is that if we don't use target nodes for validation, it remains unclear how we divide context nodes into known and unknown parts.

For training, we randomly select three nodes in the target set as unknown nodes to be inferred in each iteration. However, this strategy cannot be applied to the validation phase, because the evaluation will not be deterministic. As a trade-off, we have to use target nodes here. One possible solution is to divide the nodes into training, validation, and test nodes, just like the division of data segments. However, I think this will also bring other potential problems.

Note that there is no problem of data leakage here. As target nodes are not involved in the training phase (following the inductive setting). Meanwhile, for the validation phase, we only use node data from the validation segment. The model has no access to the test segment during the whole training process. Moreover, as all the baselines are implemented within this general framework, following the same training procedure, the comparisons are fair.

I think one way to bypass the problem is to discard the validation phase; thus the models will be evaluated on the test segment directly. I noticed that there are some inductive kriging papers following this setting, like IGNNK.

I will update this point to Readme later on. Thanks for your comments to improve the quality of the project :).

jlidw commented 6 months ago

Hi,

Thanks for your quick reply.

Yes, I understand your concerns -- the evaluation will not be deterministic if nodes are randomly selected as unknowns during validation. Also, if we divide the nodes into training, validation, and test sets, there will be new problems, for example, less spatial information is available when training the model.

However, there is still a risk of data leakage. Although the test nodes are not used to help the model convergence, in the validation phase, they are used for early stopping to obtain the proper parameters of the model.

But, discarding the validation phase may not be a good option either, since it is hard to determine when we should stop the training to obtain a good convergence.

Seems like this is an awkward problem...But it is fine that all baselines follow the same training procedure to ensure a fair comparison.

Thank you very much!

hjf1997 / STGNP

Test data leakage during validation #1