Synthetic dataset - Githubissues

pminervini commented 7 years ago

Decide synthetic dataset, and which experiments to do on it.

tdmeeste commented 7 years ago

Revived @rockt's scripts to create artificial datasets. In inferbeddings/scripts/synth, we have kb.py and sample_kb.py. I updated them both from older naga versions. Calling sample_kb.py with various arguments allows making synthetic datasets (train, test, clause files for use with inferbeddings) with controlled properties, and different types of rules

symmetric rules
simple implications
simple implications with inversion of arguments
implications with conjunctions
transitivity of a single predicate

transitivity over 3 different predicates

TAG="exp_trans_diff"
DIR="../../data/synth/sampled"
python3 sample_kb.py \
                --entities 30 \
                --predicates 10 \
                --test-prob 0.7 \
                --arg-density 0.1 \
                --fact-prob 0.1 \
                --symm 0 \
                --impl 0 \
                --impl-inv 0 \
                --impl-conj 0 \
                --trans-single 0 \
                --trans-diff 5 \
                --tag $TAG \
                --dir $DIR

Verified that the different rules work correctly, as well as adding the inferred facts (fraction test_prob in test data, and 1.-test_prob in training data).

@rockt @pminervini So far I haven't added rules with negated heads, would this be useful too?

Currently running some experiments with @pminervini's code. I'll add suggestions for specific experiments in #15.

pminervini commented 7 years ago

@tdmeeste I'm currently babysitting the X-Shot Learning experiments in https://github.com/uclmr/inferbeddings/issues/15 - it would be really great if you could handle the "experiments with synthetic datasets" subsection (maybe with @rockt)

Those are the hyperparams I'm currently using for XSL: https://github.com/uclmr/inferbeddings/blob/12b753848bd798fe74cc9ce7a0be5b4aad55920b/scripts/wn18/UCL_WN18_adv_xshot_v1.py#L25

In case you want to count the number of "ground errors" (the number of ground violations obtained by randomly replacing variables with entities), the flags are --adv-ground-samples ADV_GROUND_SAMPLES (Number of ground samples on which to compute the ground loss) and --adv-ground-tol ADV_GROUND_TOL (Epsilon-tolerance when calculating the ground loss)

tdmeeste commented 7 years ago

Here's the max vs. min approach on the synthetic datasets (sum should be similar to the result in the paper). It appears max is better in general, especially for less complex formulae. I imaging by increasing the adv_batch_size it could become better for the complex ones as well.

I completely agree with @riedelcastro's idea to create smarter synthetic datasets (with controlled evidence of the clauses, which is already foreseen, and with similar entities/relations, which isn't there yet).

uclnlp / inferbeddings

Synthetic dataset #12