unipv-larl / UD4HL

10 stars 0 forks source link

Derived causatives #9

Open pkocharov opened 10 months ago

pkocharov commented 10 months ago

I would like to address the issue of annotating derived causatives, which seem to correspond to the Cau value of the Voice feature in UD. I will use Classical Armenian as an example, but I think it may be relevant for other treebanks as well, e.g. Sanskrit. Classical Armenian has an oppositional voice (Act and Pass, where Pass may be conventionally used to tag the whole range of non-active meanings) and causatives, which are formed with the help of a dedicated causative suffix -owcՙ-, and can be additionally characterized for the oppositional voice, e.g. base verb pass. owsan-im 'I learn' > caus. act. ows-owc'-an-em 'I teach'. The question is how to map this pattern in UD? In my view, it is reasonable to use a layered feature Voice[caus] next to Voice (the latter being reserved for the oppositional voice). Ideally one would want to keep the value Cau for the Voice[caus] feature to be consistent with the annotation of causatives across treebanks. However, if I understand the feedback of the validator correctly, features with a single value seem to be allowed only if the value is 'Yes' (e.g. Reflex=Yes). What would be a preferable solution here, to use Voice[caus]=Yes (excluding it from the universal values of the Voice feature) or to introduce a dummy value of the layered feature for base verbs, which would allow for Voice[caus]=Cau. Are there yet other options? I will be greatful for any feedback.

amir-zeldes commented 10 months ago

Typically when two feature values apply to the same feature key, UD uses a comma to separate them, so since passive and causative diathesis can exist concurrently in many languages (e.g. a passive causative verb meaning "I was made to eat it" or similar), I would have expected:

Voice=Cau,Pass

This is similar to combined values for gender as described here for Fem,Masc. I think layered features are mainly used when there are two different underlying things being annotated, e.g. a possessor gender and a possessed gender. Here if they differ, something like Fem,Masc would be wrong, becuase the word being annotated doesn't really have both genders simultaneously. But for a passive causative verb, I think it is truly simultaneously passive and causative, so the comma notation should apply.

pkocharov commented 10 months ago

Many thanks for the clarification! I have overlooked this option, because I erroneously thought that such notation implies a range of values, associated with a morphological category, only one of which is valid in a given context, e.g. Voice=Pass,Mid where both values are expressed by the same verbform but only one of the two voices would hold true for a specific use of a verb. Now it all makes sense.

Stormur commented 10 months ago

Typically when two feature values apply to the same feature key, UD uses a comma to separate them, so since passive and causative diathesis can exist concurrently in many languages (e.g. a passive causative verb meaning "I was made to eat it" or similar), I would have expected:

Voice=Cau,Pass

The comma annotation is however (at least officially) intended as "either one". So I would understand Voice=Cau,Pass as "this form could be either Causative or Passive, but morphologically it is not possible to distinguish it".

So, cases of application for this notation would be, for Latin (not implemented yet):

as described here for Fem,Masc

(PS: this link seems to be broken)


This case in Classical Armenian seems to vouch for the separation of Active/Passive from Causative/other: if they can happen at the same time, then they refer to different features. In fact, I can think of a combination of both in Italian, too, though periphrastically:

Using layers, a solution like Voice[caus] would be too ad hoc in my opinion: it does not really identify the level at which the causative is. And then why not Voice[pass] instead? It could be a temporary solution, though. Or maybe something like Voice[valency]? But I am quite convinced we need to differentiate between two features, Voice and ... ?

which are formed with the help of a dedicated causative suffix -owcՙ-, and can be additionally characterized for the oppositional voice, e.g. base verb pass. owsan-im 'I learn' > caus. act. ows-owc'-an-em 'I teach'.

Sorry, can I ask you how exactly active/passive and causative interact in this example?

pkocharov commented 10 months ago

The comma annotation is however (at least officially) intended as "either one". So I would understand Voice=Cau,Pass as "this form could be either Causative or Passive, but morphologically it is not possible to distinguish it". So, cases of application for this notation would be, for Latin (not implemented yet):

  • lupis with Case=Abl,Dat -> the two cases are never distinguished in the plural, so it's actually impossible to assign one if not given a wider context, so this is information beyond morphology

It seems that the comma annotation does have an interpretation which I orignally had in mind. Thank you for this addition.

This case in Classical Armenian seems to vouch for the separation of Active/Passive from Causative/other: if they can happen at the same time, then they refer to different features.

I think that the very definition of the 'Cau' value of the universal Voice feature points to its different status compared to the values 'Act' and 'Pass' ("Causative forms of verbs are classified as a voice category because, when compared to the basic active form, they change the number of participants and their mapping on semantic roles."). CArm. seems to fully conform to this universal definition of the 'Cau' value.

Using layers, a solution like Voice[caus] would be too ad hoc in my opinion: it does not really identify the level at which the causative is. And then why not Voice[pass] instead?

Because Voice[caus] can be combined with either Voice=Act or Voice=Pass, whereas Voice[pass] would exclude a combination of the values Act and Cau.

Or maybe something like Voice[valency]?

This may be indeed a better option, which would allow to tag, for example, anticausative derivational markers by the same feature. One might probably even think of introducing a universal feature Valency with values Cau (for valency-increasing derivation) and Anticau (for valency-decreasing derivation). One might then cover morphological features of languages which combine either of these values, or both, with the inflectional voice (the latter may be the case of Hittite).

which are formed with the help of a dedicated causative suffix -owcՙ-, and can be additionally characterized for the oppositional voice, e.g. base verb pass. owsan-im 'I learn' > caus. act. ows-owc'-an-em 'I teach'.

Sorry, can I ask you how exactly active/passive and causative interact in this example?

owsan-im has a mediopassive ending -im added to the present stem of the base verb, while owsowc'an-em has an active ending -em added to the causative stem, derived from the base verb with the help of -owc'.

amir-zeldes commented 10 months ago

Adding @dan-zeman - any thoughts?

dan-zeman commented 10 months ago

Using layered features for this seems to be off, as Amir has noted. (Plus, I am puzzled by the layer name "valency" - in my opinion, the feature name Voice itself suggests it will be very much about valency.)

Also, Flavio rightly noted that the comma notation is for something else.

There is a semi-standard way to do this (meaning it is not enough promoted in the universal guidelines but it has emerged as a de-facto standard in agglutinating languages and nothing better has been invented since then, so it should probably be officially mentioned in the guidelines, too): Voice=CauPass. You have to define the combined value as language-specific, and document the "normal" values along with it, too (but you would have to do the same with layered features). See, for example, Turkish Voice. You can then define various combinations:

In Turkish, as I understand it, the sequence of values simply reflects the sequence of passive and causative morphemes suffixed to the word. But I think the combinations can be defined even if they are reflected in the morphology less straightforwardly.

pkocharov commented 10 months ago

Thank you very much for the clarifications!

In case of Armenian, a minor complication would be that some inflectional forms in the paradigm of each verb, including the derived causatives, are labile, e.g. the imperfect tense forms: ows-owc'-an-ei 'I tought/I was tought'. I guess these should then be tagged using a comma like Voice=Cau,CauPass, even though the oppositional voice is not expressed in such forms at all (so no Voice feature is evoked in the tagging of the respective forms of base verbs).

dan-zeman commented 10 months ago

so no Voice feature is evoked in the tagging of the respective forms of base verbs

Whether voice-specific morphology is observable, and what to do if it isn't, is a different question. I think that in most languages Voice=Act would denote the forms unmarked for voice. Treebanks use it because they want to say that each verb form is either active or passive; but it does not mean that there is a position in the string of morphemes that is always filled either with a passive or an active morpheme. And other treebanks simply don't use Voice=Active at all, i.e., some verb forms will have Voice=Pass (or Cau, CauPass etc.) and others will have no Voice feature. This approach is taken in the Turkish treebanks. UD does not dictate which of the two approaches should be preferred, so you can chose one, document it for Classical Armenian and then use it consistently. (You may want to synchronize with what is done in the Modern Armenian treebanks if it makes sense, i.e., if this part of the grammar has not changed significantly.)

Stormur commented 10 months ago

Using layered features for this seems to be off, as Amir has noted. (Plus, I am puzzled by the layer name "valency" - in my opinion, the feature name Voice itself suggests it will be very much about valency.)

This was just the first name that came to my mind :grimacing: I am a little confused about the exact definition of "voice". More or less any operation with verbs has or can have to do with valency, but it seems that passive and causative act at different levels. But probably the feature is always the same.

There is a semi-standard way to do this [...]

* `Voice=Act` ... I learned it

* `Voice=Pass` ... it was learned

* `Voice=Cau` ... I caused you to learn it = I taught you it (you could also understand it as equivalent to `Voice=ActCau`)

* `Voice=CauPass` ... I was caused to learn it = I was taught it

* `Voice=PassCau` ... I caused it to be learned = I taught it

This looks like a convenient temporary solution, but at the same it highlights the problem that many valency/changing operations at once can take place on the same verb (looking at "monsters" like CauCauPass...). This does seem something for layers, but how to call them? Could we envision a simple enumeration like Voice[2], Voice[3]... (of course a way to generate patterns for documentation would help here...)?


It is not really the same thing, but I can see something similar happening with other markings. For example, Latin verbs have a so-called frequentative (a diminutive degree, actually) form, and sometimes it can be repeated:

Currently, I see no way to annotate this, but it could be something like Degree=Dim|Degree[2]=Dim. In general, diminutives are a kind of marking which can be stacked.


UD does not dictate which of the two approaches should be preferred

Is it not almost mandatory not to annotate negatively defined features (proprietates ad absentiam :grimacing: )?

pkocharov commented 10 months ago

Whether voice-specific morphology is observable, and what to do if it isn't, is a different question. I think that in most languages Voice=Act would denote the forms unmarked for voice.

I aim at maintaining a strictly morphological principle in the FEAT field for Classical Armenian, which can be useful for tracing how syntax is mapped in morphology. For that reason I care which forms are marked by the active or passive voice, or unmarked, and whether they are causatives or not at the same time. So I would certainly maintain both Act and Pass values to set respective forms apart from the unmarked forms. It is the convention to use Cau for *ActCau, which creates a problem, otherwise I could specify the causative while leaving the oppositional voice untagged. More generally, splitting the "inflectional" and "derivational" voice would work perfectly for at least some ancient Indo-European languages, including CArm. The current conventions require to looks for ways around it in order to map that morphological (and maybe also functional) contrast accurately.

You may want to synchronize with what is done in the Modern Armenian treebanks if it makes sense, i.e., if this part of the grammar has not changed significantly.

The valency-coding morphology has changed a lot in Modern Armenian. Eastern Modern Armenian uses a transitivizing suffix -c'n- and an intransitivizing one -v- while the endings as such do not express the oppositional voice. Moreover, the "ArmTDP" treebank does not follow the morphological principle in assigning the Voice values. For example, one finds both forms with -v- and without it tagged as Mid, etc. I am afraid that this part of the annotation will not be entirely compatible across the Armenian treebanks.

amir-zeldes commented 10 months ago

If "," in values can only mean "either/or but not both at once", then maybe we need a canonical way to mean "multiple simultaneous values"? How about canonizing "+" for this:

Voice=Cau+Pass

I could imagine this might come in handy in other scenarios as well.

dan-zeman commented 10 months ago

How about canonizing "+" for this:

That would be a UDv3 type of change for me because it would negate a very low-level assumption required by the guidelines/validator (and thus potentially built into other UD-related tools) – the "+" character cannot occur in feature values.

But I guess there are other reasons why I don't like it:

pkocharov commented 10 months ago

Thank you once again for your comments. I will stick to the Cau / CauPass model for now in hope that additional flexibility will eventually be introduced allowing to tag derived causatives (or anticausatives for that matter) unmarked for inflectional voice in languages with the inflectional voice.