Closed pminervini closed 7 years ago
Revived @rockt's scripts to create artificial datasets.
In inferbeddings/scripts/synth
, we have kb.py
and sample_kb.py
. I updated them both from older naga versions.
Calling sample_kb.py
with various arguments allows making synthetic datasets (train, test, clause files for use with inferbeddings
) with controlled properties, and different types of rules
TAG="exp_trans_diff"
DIR="../../data/synth/sampled"
python3 sample_kb.py \
--entities 30 \
--predicates 10 \
--test-prob 0.7 \
--arg-density 0.1 \
--fact-prob 0.1 \
--symm 0 \
--impl 0 \
--impl-inv 0 \
--impl-conj 0 \
--trans-single 0 \
--trans-diff 5 \
--tag $TAG \
--dir $DIR
Verified that the different rules work correctly, as well as adding the inferred facts (fraction test_prob
in test data, and 1.-test_prob
in training data).
@rockt @pminervini So far I haven't added rules with negated heads, would this be useful too?
Currently running some experiments with @pminervini's code. I'll add suggestions for specific experiments in #15.
@tdmeeste I'm currently babysitting the X-Shot Learning experiments in https://github.com/uclmr/inferbeddings/issues/15 - it would be really great if you could handle the "experiments with synthetic datasets" subsection (maybe with @rockt)
Those are the hyperparams I'm currently using for XSL: https://github.com/uclmr/inferbeddings/blob/12b753848bd798fe74cc9ce7a0be5b4aad55920b/scripts/wn18/UCL_WN18_adv_xshot_v1.py#L25
In case you want to count the number of "ground errors" (the number of ground violations obtained by randomly replacing variables with entities), the flags are --adv-ground-samples ADV_GROUND_SAMPLES
(Number of ground samples on which to compute the ground loss) and --adv-ground-tol ADV_GROUND_TOL
(Epsilon-tolerance when calculating the ground loss)
Here's the max
vs. min
approach on the synthetic datasets (sum
should be similar to the result in the paper). It appears max
is better in general, especially for less complex formulae. I imaging by increasing the adv_batch_size
it could become better for the complex ones as well.
I completely agree with @riedelcastro's idea to create smarter synthetic datasets (with controlled evidence of the clauses, which is already foreseen, and with similar entities/relations, which isn't there yet).
Decide synthetic dataset, and which experiments to do on it.