Theoretical question about test_timestamp

breakanalysis commented 1 month ago

Hello!

Thank you for building this cool tool :)

In tutorials/custom_dataset.ipynb it is mentioned that

only database rows upto test_timestamp should be used by the model to predict over the test set

My understanding is that typically seed_times for rows in the test_table will be greater than dataset.test_timestamp. (That belief comes from 1) it seems restrictive to assume that all seed times in the test set are identical and 2) it seems risky to have test examples occuring before the global test_timestamp. Thus it follows some seed times will exceed the test_timestamp.).

If this understanding is correct, masking the entire entity graph upto test_timestamp makes some test examples very difficult, or impossible to predict. For example, if the test_timestamp is 2010-01-01 and a test example occurs 2020 with less than 10 years of historical data attached to it, pruning the graph with a single time will make this test case have an empty history.

Moreover, I dont see why pruning the entity graph globally is necessary to prevent test leakage; Doesnt the torch geometric's NeighborLoader prune away future data in the computation graph associated with a test example, thus preventing time leakage entirely? It's docs mention the follwing:

 time_attr (str, optional): The name of the attribute that denotes
            timestamps for either the nodes or edges in the graph.
            If set, temporal sampling will be used such that neighbors are
            guaranteed to fulfill temporal constraints, *i.e.* neighbors have
            an earlier or equal timestamp than the center node.
            (default: :obj:`None`)

TLDR; Can pruning the entire graph at test_timestamp hurt test performance, and is it really necessary to prevent leakage?

Best regards, Jacob

rishabh-ranjan commented 1 month ago

Thanks for your questions.

We built the benchmark such that the seed_times in the test table are always the test_timestamp (rel-f1 is the only exception because there would be very few test samples otherwise). So, your hypothetical scenario where pruning the graph at test_timestamp hurts test performance does not typically arise. One way to think of this is that we train the model with info upto test_timestamp, and test it to make future predictions just beyond test_timestamp. Since the predictions can be made for many entities, there are still many samples to test against.
Yes, under our current implementation this pruning is not necessary to prevent leakage. But the benchmark is more general and intended to be used by people to develop their own methods. We make the conscious choice to avoid showing the test rows to the model to prevent any bugs. Under the current setup, bugs would lead to worse test set performance and be detectable. Without pruning, temporal leakage bugs would lead to better test performance and would hinder fair comparison on the leaderboard for example.

Hope this clarifies your doubts.

breakanalysis commented 1 month ago

Thank you Rishabh!

Your answer does clarify all of my questions and it is understandable that you want the public leaderbord be protected from data snooping :)

snap-stanford / relbench

Theoretical question about test_timestamp #267