nert-nlp / cgel

CGEL trees.
Creative Commons Attribution 4.0 International
6 stars 3 forks source link

Pronoun/determinative/possessive lemmas #128

Open nschneid opened 2 months ago

nschneid commented 2 months ago

We should standardize these and enforce in the validator. As is, e.g. "its" is sometimes lemmatized as "it".

The UD lemmatization policies have evolved and are summarized here for pronouns. Basically,

(discussion at https://github.com/UniversalDependencies/docs/issues/517)

We could simply adopt the UD policies; or, because they potentially diverge from CGEL at least with regard to possessives, and as pronouns and determinatives are closed classes, we could simply omit the lemmas from the CGELBank trees, and provide a lookup table for anyone who wants them.

Also, for full nouns with a possessive ending, whether that is lemmatized to the non-possesssive form should be consistent. (The possessive ending is considered a separate syntactic word in UD, but not in CGEL; in UD-derived data this is make explicit with :subt features.)

nschneid commented 2 months ago

Possible solution that would minimize manual annotation effort:

BrettRey commented 2 months ago

I guess I'm not following. You write, "We should standardize these and enforce in the validator. As is, e.g. "its" is sometimes lemmatized as "it"." That seems fine. Is the issue that it is only sometimes lemmatized as "it"? Or is there some reason it shouldn't ever be lemmatized?

nschneid commented 2 months ago

A lemma is only sometimes provided explicitly for "its"—the annotations are inconsistent across files.

We have to decide: (1) For pronouns and determinatives, which are a closed set, do we want to ask annotators to specify the lemma explicitly in the .cgel file, or compute it automatically as part of the API? (2) If their lemmas are specified explicitly, do we want to be compatible with UD lemmas?

BrettRey commented 2 months ago

OK, I get it.

  1. I see no need for annotators to specify the lemma, but it would be good if they were computed automatically.
  2. I don't have a strong opinion on UD compatibility.
nschneid commented 1 month ago

A reminder to myself that we DO want hand-specified lemmas not just for nouns and verbs, but also adjectives/adverbs inflected for grade (comparative/superlative).

Coordinators, Subordinators, Prepositions are not normally expected to inflect/have lemmas, though it is conceivable in the cases of spelling variation ("&" / "and", "@" / "at" etc.). Or the non-abbreviated form could be indicated as the :correct form.