train/dev/test split - Githubissues

shehzaadzd / MINERVA

Meandering In Networks of Entities to Reach Verisimilar Answers

Apache License 2.0

310 stars 87 forks source link

train/dev/test split #20

Closed ghost closed 5 years ago

ghost commented 5 years ago

Why your dev triples are included in training data?

code/data/preprocessing_scripts/nell.py: out_file.write(e1+'\t'+r+'\t'+e2+'\n') if np.random.normal() > 0.2: ----dev.write(e1+'\t'+r+'\t'+e2+'\n')

Theoretically you are supposed to split it into 2 datasets (train/test) or 3 (train/dev/test) without overlaps. Please explain the reason behind this. Thank you.

shehzaadzd commented 5 years ago

The NELL dataset consisted of a train/test split. The dev was created for hyperparameter tuning. The preprocessing script is not complete. There was another script used to remove out the duplicates and inverse duplicates. You can use the dev set we created (https://github.com/shehzaadzd/MINERVA/blob/master/datasets/data_preprocessed/nell/dev.txt).

ghost commented 5 years ago

Thank you for your response. I understand you split NELL train data into train and dev sets, Would you please let me know what was the proportion of the train/dev split you used in your paper? Because I am trying to reproduce the experimental results on your paper. I notice you didn't mention it on your paper. Thank you.

shehzaadzd commented 5 years ago

We tried to extract 20% but after removing duplicates (and inverse duplicates) and removing triples which contained the only occurrence of an entity, we were left with ~500 triples. You could use https://github.com/shehzaadzd/MINERVA/blob/master/datasets/data_preprocessed/nell/dev.txt to reproduce our results.

ghost commented 5 years ago

I see, appreciate it.