ufal / treex

Treex NLP framework
33 stars 6 forks source link

Multivalue gender and number grammatemes #61

Closed michnov closed 7 years ago

michnov commented 7 years ago

Up to now, gram/gender and gram/number were always assigned a single value. Even if morphology and syntax suggested ambiguous solution, e.g. gram/gender=fem|neut, only one of the values or a universal nr (not recognized) must have been picked. By doing this, one loses information coming from morphological analysis (e.g. in a Hajic's tag, multiple values of gender and number are supported). For tasks such as coreference resolution of pronouns, where the anaphor usually matches in gender and number in antecedent, premature disambiguation (by setting one of the possible values) or removing the restrictions (by allowing for any value with nr) may be harmful.

In this pull request, we allow for predefined combinations of single values. These combinations are based on the Czech language, and might be extended if needed for another language. The new values consist of the original single values separated by a | symbol. All possible combined values have been added to the Treex PML schema, so these values are treated as atomic. Therefore, the combined values must be always set in a predefined order (e.g. fem|neut, not neut|fem).

Blocks for Czech that set these grammatemes (A2T::CS::AddPersPron and A2T::CS::SetGrammatemes) have been adjusted to set the combined values in case of ambiguity. We also adjusted (and moved) the disambiguating block A2T::DisambiguateGrammatemes, which performs the disambiguation either using a coreferential link or by a simple rule. This block can be called on places, where the subsequent blocks do not count on multiple values for the grammatemes, e.g. in translation.

Applying this pull request, the cs->en translation improved a tiny bit (measured in BLEU): QTLeap news: 14.39 -> 14.4 QTLeap Batch3q: 21.27 -> 21.28