uclnlp / inferbeddings

Injecting Background Knowledge in Neural Models via Adversarial Set Regularisation
MIT License
59 stars 12 forks source link

Synthetic dataset #12

Closed pminervini closed 7 years ago

pminervini commented 7 years ago

Decide synthetic dataset, and which experiments to do on it.

tdmeeste commented 7 years ago

Revived @rockt's scripts to create artificial datasets. In inferbeddings/scripts/synth, we have kb.py and sample_kb.py. I updated them both from older naga versions. Calling sample_kb.py with various arguments allows making synthetic datasets (train, test, clause files for use with inferbeddings) with controlled properties, and different types of rules

@rockt @pminervini So far I haven't added rules with negated heads, would this be useful too?

Currently running some experiments with @pminervini's code. I'll add suggestions for specific experiments in #15.

pminervini commented 7 years ago

@tdmeeste I'm currently babysitting the X-Shot Learning experiments in https://github.com/uclmr/inferbeddings/issues/15 - it would be really great if you could handle the "experiments with synthetic datasets" subsection (maybe with @rockt)

Those are the hyperparams I'm currently using for XSL: https://github.com/uclmr/inferbeddings/blob/12b753848bd798fe74cc9ce7a0be5b4aad55920b/scripts/wn18/UCL_WN18_adv_xshot_v1.py#L25

In case you want to count the number of "ground errors" (the number of ground violations obtained by randomly replacing variables with entities), the flags are --adv-ground-samples ADV_GROUND_SAMPLES (Number of ground samples on which to compute the ground loss) and --adv-ground-tol ADV_GROUND_TOL (Epsilon-tolerance when calculating the ground loss)

tdmeeste commented 7 years ago

Here's the max vs. min approach on the synthetic datasets (sum should be similar to the result in the paper). It appears max is better in general, especially for less complex formulae. I imaging by increasing the adv_batch_size it could become better for the complex ones as well.

I completely agree with @riedelcastro's idea to create smarter synthetic datasets (with controlled evidence of the clauses, which is already foreseen, and with similar entities/relations, which isn't there yet).

image