Collect Hypotheses to Test

riedelcastro commented 7 years ago

For the paper and the experiment section, it would be good to be precise about the hypotheses we like to test, and how to test them. Here is a start:

Adversarial learning is more efficient than random sampling (NAACL). Tests:
- lower ground rule violation after same amount of training time (or less time for same ground rule violation counts), ideally on real and synthetic datasets
- better accuracy after same time (this test somehow conflates things a little)
Adversarial learning is more general than EMNLP approach ...
and generality is useful in practice
- Test: show some improvements using types of formulae not supported by EMNLP, ideally over SOTA but at least for ZSL
Adversarial learning for rules works: by finding "synthetic" violators, and pushing them down, real violators disappear (presumably because they are similar to the synthetic violators)

Feel free to comment, edit and add more...

rockt commented 7 years ago

Experiments:

X-Shot Relational Learning
1. Extract rules on full dataset using AIMEE
2. Subsample X% of training triplets in dataset while keeping all test triplets
3. Inject rules using a) Inferbeddings and b) NAACL-like approach Do this for X in [10,20,...,100] and plot "learning" curve
Debug an Embedding Model using Inferbeddings
1. Take a small dataset (like UW)
2. Run Embedding model
3. Investigate dev set predictions and find systematic errors
4. Add rules that would fix systematic errors
5. Retrain using Embedding model + Inferbeddings
6. Hope for improvements ;)
(Optional) Reinforcing Rules
1. Take large dataset (like Yago/DBPedia)
2. Extract many (potentially low-confidence rules with less support) using AIMEE
3. Manually annotate rules
4. Inject remaining rules using Inferbeddings)
(EMNLP Idea) Get External Rules
1. Take large dataset
2. Use external resource such as WordNet
3. Find alignments between words in WordNet and predicates in dataset
4. Add rules based on WordNet relations (problem: relations are probably too simple)

pminervini commented 7 years ago

Setting up the X-Shot Relational Learning experiment right now.

On a side note - I'm wondering if the current approach might be useful also to "mine" rules, e.g. by trying out different rules and checking whether they are violated in the embedding space

tdmeeste commented 7 years ago

Suggestions for the part on NYT experiments:

show average number of zero-one errors per clause (out of 100) as function of epochs (starts small, goes up, and goes nicely to 0 after a while), to show that the model can indeed learn to satisfy the rules,
show influence of initializing violating embeddings with actual embeddings vs. randomly
context: I would take the zero-shot setting (with different fractions of visible FreeBase facts, as in NAACL), otherwise the facts contain too much evidence for the rules. The used rules are those mined from the training data for the NAACL paper, all right?
the actual numbers on the total data appear to be lower than in the EMNLP paper (wMAP 0.62 vs. 0.65 for the model on all train facts). I would put no emphasis on them, given the difficult evaluation context (only using the prior work up to Model F results in the pool, and no further annotations) and differences in the model, even for the DistMult model with dummy object: unit sphere normalization etc).
we could show the difference between the pairwise logistic vs pairwise hinge loss (logistic = bit better), but I would prefer keeping things simple, and just use the logistic loss.

By the way, @pminervini, It appears that forcing the dummy embeddings to ones has a negative impact on the results. By not doing that, we have a somewhat 'extended' model F with an additional global weighting of all dimensions - all right if we do that? It's logical from the point of view of DistMult, and I think it will suffice if we shortly mention how we mimic model F based on DistMult.

For the part on the synthetic data experiments: (is there going to be space for that?)

Simple description (I kept the construction of the artificial data as simple as possible to explain).
I feel like the only added value would be to explicitly show that the model can actually learn to satisfy other types of rules than simple implications. A simple table would suffice for that. Alternatively, we could show the learning process, based on violated examples as a function of epochs again. What do you say? Because of the controlled environment, that may allow us to show that, e.g., clauses involving 3 different predicates, are harder than simpler clauses.

rockt commented 7 years ago

For NYT, I think one of the main experiments that would be good to have and that we discussed briefly yesterday is to

take the entire dataset
extract the clauses
gather the set of head predicates H of all clauses
subsample the dataset, but only for facts with predicates in H, i.e., for every predicate appearing as head in one of the clauses, drop 10%, 20%, ..., 100% of the facts

This should increase the margin we see for using rules vs. not using rules quite dramatically as it is closer to the NAACL and EMNLP zero and x-shot experiments.

tdmeeste commented 7 years ago

Not sure I understand exactly what you mean.

extract the clauses

do you mean: take those 36 NAACL simple implications (and as model: the Model F simplification of DistMult)?

Or: use Amie+ to extract a wider set of rules of the forms q(X0, X1) :- p(X0, X1) or even r(X0, X1) :- p(X0, X1), q(X0, X1) (to be manually pruned?) for which the Model F simplification of DistMult can still be used?

Or: use Amie+ to extract a new rule set with more general rules (to be manually pruned) from the training data with Amie+, involving both FB and NYT heads and body's, and don't model embeddings for (subj, obj) together as in NAACL/EMNLP, but use TransE, ComplEx, DistMult (possibly with worse results compared to learning entity pair embeddings).

Completely agree with the setting of dropping head facts, that's what I head in mind for the experiment with the 36 simple implication clauses with FreeBase head predicates.

riedelcastro commented 7 years ago

Can’t we use the same rules and clauses we used for the NAACL/EMNLP experiments?

S

On 21 Feb 2017, at 19:27, Tim Rocktäschel notifications@github.com wrote:

For NYT, I think one of the main experiments that would be good to have and that we discussed briefly yesterday is to

take the entire dataset extract the clauses gather the set of head predicates H of all clauses subsample the dataset, but only for facts with predicates in H, i.e., for every predicate appearing as head in one of the clauses, drop 10%, 20%, ..., 100% of the facts This should increase the margin we see for using rules vs. not using rules quite dramatically as it is closer to the NAACL and EMNLP zero and x-shot experiments.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uclmr/inferbeddings/issues/15#issuecomment-281452980, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKMzCuAJX32Uw09jxs8YbKB_J2ZAmLzks5rezq1gaJpZM4L-LJc.

tdmeeste commented 7 years ago

sure! Only disadvantage: they're very simple. I would prefer the 36 pruned NAACL clauses, because for these there are results as a function of fraction of head facts in both papers.

@riedelcastro How about adding synthetic data experiments to for analyzing the more complex rules (see suggestions above)?

rockt commented 7 years ago

Yes, I think they would be too simple and I don't know what we would expect to see. The EMNLP approach is probably very efficient for these simple clauses.

uclnlp / inferbeddings

Collect Hypotheses to Test #15