malllabiisc / EmbedKGQA

ACL 2020: Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings
Apache License 2.0
415 stars 95 forks source link

MetaQA KG embedding with data leakage #94

Closed AndDoIt closed 3 years ago

AndDoIt commented 3 years ago

Thanks for your excellent work for multi-hop KGQA! I find the KG dataset of MetaQA that you provided has serious data leakage among train.txt, valid.txt and test.txt, so I want to make sure whether your pre-trained embeddings are based on this dataset.

apoorvumang commented 3 years ago

Yes, there is overlap between valid.txt and train.txt triples. test.txt should not have leakage AFAIK

This is because our aim is to train embeddings for QA, not for KG completion. Ideally, there should be no valid/test.txt since we should train on all the triples. The purpose of valid/test.txt is to just keep the format that is used by most KGE methods to create embeddings (eg libKGE, pykeen).

AndDoIt commented 3 years ago

Thanks for your reply, I got it. Since I randomly copy any triplet from test.txt in the KG dataset of MetaQA, it co-occurs in the train.txt, so could you please check the corresponding dataset again?

apoorvumang commented 3 years ago

I mixed up valid and test.txt in my last message - test.txt has triples from train.txt while valid.txt shouldn't have overlap.