Closed PlusRoss closed 3 years ago
Hi,
from the original files we generate train/val/test splits as here: https://github.com/migalkin/StarE/blob/f40a5ee082d61851477e9870c21e991c7d91deb3/run.py#L175
Then, we only use train_data to generate an input graph G for the GNN encoder (this might be the line you were looking for): https://github.com/migalkin/StarE/blob/f40a5ee082d61851477e9870c21e991c7d91deb3/run.py#L218
And then we initialize the model only with this training graph: https://github.com/migalkin/StarE/blob/f40a5ee082d61851477e9870c21e991c7d91deb3/run.py#L263
That is, we train on train and evaluate on validation/test (depending on the flag) - there is no leak
Hi,
Thanks for your quick reply and sorry for the confusion in my question. Yes, there's no test data leakage problem in your code as you stated.
But the problem is that you construct the graph with training edges, supposed to be _(s_i, r_i, o_i, {k_ij, vij}) where (s,r,o) is the main triple and (k, v) is the qualifier and i in {1,2,...,N} is the index of edge (N is the number of training edges). And during training, for example, you want to predict the object of the third edge _(s_3, r_3, ?, {k_3j, v3j}). But this edge is already being used to construct the graph. This means that just memorizing all the edges in the graph can lead to 100% accuracy in training without any generalization. I'm just wondering whether this makes sense or not.
Thanks!
Well, that's the crux of the link prediction task and the LP literature in general - technically, we can memorize the edges but it won't generalize on validation/test edges. That's why at the training phase we optimize GNN encoder weights, decoder weights (and entity/relation embeddings if node/edge features are not given), and predict training edges.
That said, I don't see any issue here, it's a standard training procedure
But if the model is only trained on training edges that are also used to construct the graph, how to make sure the model will generalize to the edges that are not in the graph (i.e. validation/test edges)? there's no loss with respect to the edges not in the graph to optimize the model for generalization.
there's no loss with respect to the edges not in the graph to optimize the model for generalization.
There are two scoring procedures that take into account non-existing or incorrect (but that might appear true in val/test) when training link prediction models:
1-N scoring (like we use):
https://github.com/migalkin/StarE/blob/f40a5ee082d61851477e9870c21e991c7d91deb3/models/models_statements.py#L98
when our result tensor is of shape [batch_size, num_entities]
- so that we score a given statement s,r, [quals]
against all known entities. Think of it as if we were building edges from a given node s
to all nodes in the graph. Our target object (true labels) are also of the same shape, so the loss takes into account both existing and non-existing edges.
negative sampling - a similar methodology which is used for bigger graphs when N is way too large, so one takes 50-500 negative samples (non-existing edges) and compares if predictions of true edges are higher than negatives.
Thanks for your answer. Although I'm not quite convinced about this, still I really appreciate your replies. Overall this is a very good paper!
Hi,
For the link prediction task, the paper use triplets to construct a graph G as an input of the GNN encoder. In this case, G already contains the information of the missing objects/subjects of triplets to be predicted. I did not find the code to separate the triplets to be predicted and the triplets used to construct the graph. It seems that during training, the model is asked to complete the triplet which is already contained in the input?