OOV token(s) in the generalization set intended?

TL;DR

TL;DR: it can be hard to generalize to sentences containing words never seen during training. Is this part of your definition of what it means to compositionally generalize?

Background

One way to interpret the Principle of Compositionality would be that knowing the meaning of all words (let's ignore problems of disambiguation and idioms for now) is a necessary prerequisite to be able to 'understand' (more precisely, to compute the meaning of) a sentence. In other words, no out-of-vocabulary (OOV) words are allowed.

Imagine you are asked to provide the meaning of a sentence (e.g. as logical form) containing a word that you've never encountered before: would you be able to do so? The COGS logical forms make it tempting to say yes at least for nouns (only singular number nouns here), because their morphological form on the input side is character-wise identical to a token of the logical form (e.g. 'The boy' translates to * boy ( x _ 1 ) ;). But do we want to rely on that cheap copying trick?
I can see some justification for the case of proper nouns ('names'), but not really for common nouns like 'gardner' (sic!) or 'monastery'.

No matter whether your definition of compositional generalization includes dealing with OOV words or not, I rather have it made explicit, that's why I am raising the issue here.

I actually only stumbled upon this because there were a couple of sentences across all different approaches I tried that never succeeded according to the exact match criterion, even hindering reaching 100% dev set accuracy no matter how long I trained (always stayed at 99.97% due to that one 'gardner' sentence, see below)

Concrete numbers

Using commit version from April 2021: 6f66383 I got the following numbers (e.g. with grep -c word ./data/*.tsv to count lines (=samples), not word counts):

word	train.tsv	train_100.tsv	dev.tsv	test.tsv	gen.tsv
monastery	0	1	0	0	12
gardner	0	0	1	1	10
---------	---------	-------------	-------	--------	-------
total	0	1	1	1	22

Last number obtained with grep -c 'monastery\|gardner' train.tsv

In the generalization set, the PP recursion generalization type (pp_recursion) seems to be affected most (6 samples with 'monastery', 9 with 'gardner': not overlapping samples).

As a consequence, a model which builds its vocabulary from its training set only will struggle with 1 sentence each on dev and test, and 22 or 10 samples (depending on whether trained on train.tsv or train_100.tsv) on the gen set. If the dev set is included in the vocabulary, at least for the train.tsv training the problem of OOV words in the gen set (12 'monastery' samples) remains to some degree.

Question

Long story short, my question is whether you require models to deal with OOV words in order to solve COGS' generalization set and succeed at compositional generalization?

I've read your EMNLP paper which introduced the COGS dataset, but haven't found any comment on that. I would be very glad if you could point me to it in case I missed it.

Thank you in advance!

najoungkim / COGS