thunlp / OpenKE

An Open-Source Package for Knowledge Embedding (KE)
3.83k stars 985 forks source link

Test data leakage when using type constraints #232

Closed dschaehi closed 4 years ago

dschaehi commented 4 years ago

It seems when generating type constraints triples in the test data are also involved. Isn't it then quite obvious that type constraints will help improve the performance of the models, because of this test data leakage? Wouldn't it be better (and fairer) to only include the triples in the training data when generating type constraints?

https://github.com/thunlp/OpenKE/blob/12bfd9c68b911a9ce34cd6917cb1af884219f23f/benchmarks/FB15K237/n-n.py#L44-L59

THUCSTHanxu13 commented 4 years ago

The type constraints are not one method to improve the performance of the models. It is just a test approach to reduce the numbers of candidate entities.

THUCSTHanxu13 commented 4 years ago

In fact, data generated by n-n.py does not be used to train models. If you want to adopt this information for training your models, to run n-n.py on training triples is more suitable.

dschaehi commented 4 years ago

Oh I see. Thanks for the clarification @THUCSTHanxu13.

it is weird though that the QuatE paper reports about scores using type constraints, when using type constraints does not seem to contribute much to the interpretation of the performance of a model.