malllabiisc / CompGCN

ICLR 2020: Composition-Based Multi-Relational Graph Convolutional Networks
Apache License 2.0
597 stars 107 forks source link

Data leakage? #14

Closed guolingbing closed 4 years ago

guolingbing commented 4 years ago

Hi, I have some problems about the function load_data in Runner .

Specifically, in line 50-76:

sr2o obviously contains the data for testing and validating.

                self.data = ddict(list)
        sr2o = ddict(set)

        for split in ['train', 'test', 'valid']:
            for line in open('./data/{}/{}.txt'.format(self.p.dataset, split)):
                sub, rel, obj = map(str.lower, line.strip().split('\t'))
                sub, rel, obj = self.ent2id[sub], self.rel2id[rel], self.ent2id[obj]
                self.data[split].append((sub, rel, obj))

                if split == 'train': 
                    sr2o[(sub, rel)].add(obj)
                    sr2o[(obj, rel+self.p.num_rel)].add(sub)

        self.data = dict(self.data)

        self.sr2o = {k: list(v) for k, v in sr2o.items()}
        for split in ['test', 'valid']:
            for sub, rel, obj in self.data[split]:
                sr2o[(sub, rel)].add(obj)
                sr2o[(obj, rel+self.p.num_rel)].add(sub)

Then, you generate the label based on sr2o.

        self.sr2o_all = {k: list(v) for k, v in sr2o.items()}
        self.triples  = ddict(list)

        for (sub, rel), obj in self.sr2o.items():
            self.triples['train'].append({'triple':(sub, rel, -1), 'label': self.sr2o[(sub, rel)], 'sub_samp': 1})

You use self.triples['train'] to obtain data_iter

self.data_iter = {
            'train':        get_data_loader(TrainDataset, 'train',      self.p.batch_size),
            'valid_head':   get_data_loader(TestDataset,  'valid_head', self.p.batch_size),
            'valid_tail':   get_data_loader(TestDataset,  'valid_tail', self.p.batch_size),
            'test_head':    get_data_loader(TestDataset,  'test_head',  self.p.batch_size),
            'test_tail':    get_data_loader(TestDataset,  'test_tail',  self.p.batch_size),
        }

and finally train the model.

        train_iter = iter(self.data_iter['train'])

        for step, batch in enumerate(train_iter):
            self.optimizer.zero_grad()
            sub, rel, obj, label = self.read_batch(batch, 'train')

Did I misunderstand something?

guolingbing commented 4 years ago

Sorry, I overlooked this:

self.sr2o = {k: list(v) for k, v in sr2o.items()}