joshua-decoder / thrax

Hadoop-based tool for extraction of large scale synchronous grammars for paraphrasing and machine translation
joshua-decoder.org
Other
15 stars 6 forks source link

extraction: clearer default label policies #5

Closed jweese closed 11 years ago

jweese commented 11 years ago

This patch introduces the configuration key "allow-default-nt" to let the user specify when default labels should be allowed for extracted rules.

The possible keys are "always", "phrases", or "never", with "always" being the default. These options correspond to the following policies:

ALWAYS: the default NT is always allowed to be used for the LHS or gap in any rule. (Hiero policy.)

PHRASES: the default NT is only allowed for the LHS of non- hierarchical rules. (This is the default SAMT policy.)

NEVER: never use the default NT for any rule.

The previous configuration setting to change this policy was the poorly named "allow-nonlexical-x". This was a boolean key, where true meant the default NT could be used anywhere, and "false" meant it would be disallowed for the LHS of a non-hierarchical rule. This behavior has been kept for backward compatibility. The mapping is thus

"allow-nonlexical-x" true => "allow-default-nt" "always" "allow-nonlexical-x" false => "allow-default-nt" "phrases"

Previously, there was no way to set "never".