Closed weissenh closed 2 years ago
Thanks for bringing this up Pia! I agree that OOV in the generalization set isn't ideal, although I'm not surprised a small number of such cases are found due to the sampling procedure (it doesn't guarantee that all vocabulary items gets sampled). It looks like the affected # of datapoints are relatively few so for the time being you can exclude the 22 cases in your evaluation (I don't think it will qualitatively affect reported results).
In the longer run, I'm thinking of adding more dataset samples (same sampling procedure but with different random seeds) to examine variation there, so for those ones I'll try to fix this issue.
TL;DR
TL;DR: it can be hard to generalize to sentences containing words never seen during training. Is this part of your definition of what it means to compositionally generalize?
Background
One way to interpret the Principle of Compositionality would be that knowing the meaning of all words (let's ignore problems of disambiguation and idioms for now) is a necessary prerequisite to be able to 'understand' (more precisely, to compute the meaning of) a sentence. In other words, no out-of-vocabulary (OOV) words are allowed.
Imagine you are asked to provide the meaning of a sentence (e.g. as logical form) containing a word that you've never encountered before: would you be able to do so? The COGS logical forms make it tempting to say yes at least for nouns (only singular number nouns here), because their morphological form on the input side is character-wise identical to a token of the logical form (e.g. 'The boy' translates to
* boy ( x _ 1 ) ;
). But do we want to rely on that cheap copying trick?I can see some justification for the case of proper nouns ('names'), but not really for common nouns like 'gardner' (sic!) or 'monastery'.
No matter whether your definition of compositional generalization includes dealing with OOV words or not, I rather have it made explicit, that's why I am raising the issue here.
I actually only stumbled upon this because there were a couple of sentences across all different approaches I tried that never succeeded according to the exact match criterion, even hindering reaching 100% dev set accuracy no matter how long I trained (always stayed at 99.97% due to that one 'gardner' sentence, see below)
Concrete numbers
Using commit version from April 2021:
6f66383
I got the following numbers (e.g. withgrep -c word ./data/*.tsv
to count lines (=samples), not word counts):Last number obtained with
grep -c 'monastery\|gardner' train.tsv
In the generalization set, the PP recursion generalization type (
pp_recursion
) seems to be affected most (6 samples with 'monastery', 9 with 'gardner': not overlapping samples).As a consequence, a model which builds its vocabulary from its training set only will struggle with 1 sentence each on dev and test, and 22 or 10 samples (depending on whether trained on
train.tsv
ortrain_100.tsv
) on the gen set. If the dev set is included in the vocabulary, at least for thetrain.tsv
training the problem of OOV words in the gen set (12 'monastery' samples) remains to some degree.Question
Long story short, my question is whether you require models to deal with OOV words in order to solve COGS' generalization set and succeed at compositional generalization?
I've read your EMNLP paper which introduced the COGS dataset, but haven't found any comment on that. I would be very glad if you could point me to it in case I missed it.
Thank you in advance!