Closed jimz7 closed 1 month ago
Thanks for your interest in our work! Indeed the naming of the metrics on the fuzzy logic task is a bit confusing: What we call OOD in the paper (out-of-distribution combinations of terms) is reported as Test R2. What is called OOD R2 in the code is an even more difficult setting where we hold-out whole terms. For easy reference, the last paragraph on page 5 refers to this setting:
Our compositional split only contains novel combinations of known terms. A priori, it might be possible that the system learns to compose the basic fuzzy logic operations in a way that allows it to generalize to any fuzzy logic function, including those that contain novel combinations of unknown terms. However, testing on functions obtained as novel combinations of K unknown terms, we find that none of the models considered here is able to solve such a task.
I tried
python run.py --config 'configs/logic.py:logic_4var_2term;linear_hypatt'
and obtained negative OOD R2 (around -0.18) score while the training and testing R2 are all positive (all around 0.87) for the fuzzy logic test. Runningpython run.py --config 'configs/logic.py:logic_4var_2term;transformer'
also gives negative OOD R2 (around -0.49). How can I fix the problem?