tibetan-nlp / annotation-docs

Tibetan annotation docs
2 stars 0 forks source link

Change arg1 to arg2 for yod #18

Closed heacu closed 6 years ago

heacu commented 6 years ago

We have been tagging the unmarked argument of the existential / locative / possessive verb yod as arg1, and the case-marked argument (e.g. the location or the possessor) as obl:arg. See example 4 at: https://tibetan-nlp.github.io/lim-annodoc/deprel/arg1/

This approach is not ideal. To illustrate I will use a Tinglish example based on the above example.

_myuncle-to sons many yod "My uncle has many sons" (= To my uncle many sons exist.)

The problem concerns the possibility of variable or changing case-marking patterns on the possessor. If "my uncle" appears without case at some stage or in some texts, then it would become arg1 and "many sons" would shift to arg2, violating our premise that the arguments of a verb should retain the same label when the verb is used with the same sense. For the same semantic role to be sometimes arg1 and sometimes arg2 without evidence of valency shifting operations would make searches difficult. In such a case, it would be better if arg2 remained the same and only the "my uncle" part changed. The new proposal is to analyse it as obl:arg now, and arg1 if it didn't have a case marker.

Compare this with example 1 at https://tibetan-nlp.github.io/lim-annodoc/deprel/arg3/. Here, what we would expect to be an obl:arg recipient occurs without a case marker and so is marked arg3 instead. However, this doesn't change the other argument labels, so it is not problematic in the same way.

The particular scenario described above is hypothetical - we don't have evidence yet of the argument structure of yod changing in this way. However, there are other verbs for which the scenario is more plausible. In Lhasa Tibetan there are verbs like rag "to obtain", rnyed "to find", and byung "get" which occur with oblique case-marked "subjects", e.g.

_myuncle-to book find "My uncle found a book."

Our current analysis of yod would lead us to annotate "book" as arg1 and "my uncle" as obl:arg. This analysis would make it difficult to distinguish this case from a verb that has a subject and an oblique recipient. Although I can't think of any such examples where there isn't also an understood arg2, I suppose it could happen. If we instead take book as arg2, then "to my uncle" can remain as obl:arg.

Essentially the new proposal consists of two parts:

  1. Make the theme/patient the arg2 if you can. "Many sons" and "book" do not really seem like arg1 in other respects, so follow the semantics.
  2. Build-in flexibility for broadening our definition of arg1 in the long term. Right now the only case-marked nominals that we allow as arg1 are agentive-case marked nominals. But, research on similar languages shows that subjects can have other case markers, like dative case and so on. If we discover evidence of this for Tibetan, then we can make a change in our scheme without requiring us to relabel the arg2.
heacu commented 6 years ago

One pleasing consequence of this proposal is that we could eliminate arg1:lvc from our repertoire because all such cases would be recast as arg2:lvc.

heacu commented 6 years ago

I presented the above proposal to Miriam Butt and she responded as below:


What would be important to have and to be able to search for in a dependency bank, would be the morphosyntactic information and to see how that plays out with respect to the grammatical relations.

In that sense, your annotation scheme is quite confusing because it makes a big deal about what is actually the most uninformative case -- the unmarked nominative. But you relegate into one "obl" box all the stuff where there is potentially a lot of action going on that would be interesting to analyze and understand better.

I got that you said the nominative is most ambiguous and therefore you have 3 labels for it, but it does seem to place undue importance on this to the detriment of the ambiguity found with oblique marked arguments, for which you have only one category (plus the almost invisible erg distinction within arg1).

What would be more useful, I think, is to have an annotation scheme where you did something like:

arg1:nom/erg/obl arg2:nom/obl arg3:nom/obl

whereby the arg1 roughly corresponds to nsubj of UD and arg2 roughly corresponds to obj and arg3 roughly corresponds to iobj/obl of UD (i.e., the "to X" argument in English)

For adjuncts, you could then do adv:obl or even adv:nom if ever found that (which you might). Rather than your obl:adv.

In this way, if one were looking to see what ergatives did in the language, one would be able to fish those out with a single query, namely arg1:erg. And one could contrast that with all the other arg1 very simply.

If one were interested in Differential Object Marking, one could check whether Tibetan had that by fishing out all the arg2s and seeing whether they are nom or obl.

So this would be an extremely useful way of cutting up the pie.

... The problem concerns the possibility of variable or changing case-marking patterns on the possessor. If "my uncle" appears without case at some stage or in some texts, then it would become arg1 and "many sons" would shift to arg2, violating our premise that the arguments of a verb should retain the same label when the verb is used with the same sense. For the same semantic role to be sometimes arg1 and sometimes arg2 without evidence of valency shifting operations would make searches difficult. In such a case, it would be better if arg2 remained the same and only the "my uncle" part changed. The proposal is to analyse it as obl:arg now, and arg1 if it didn't have a case marker.

Well, these are difficult because it is not clear which is the arg1 and which is the arg2. But in a way, if you made explicit what the case marking information is, then it wouldn't be so crucial to fight about it. so you could either go arg1:obl and arg2:nom or you could go arg1:nom and arg2:obl.

According to what you propose below, the solution arg1:obl and arg2:nom seems the better one if one tells annotators to think of arg2 as more patientlike and arg1 as more agentlike. Then one would have that as an additional guideline. Without, however, committing to whether "uncle" is a subject or not (which it may or may not be).

The particular scenario described above is hypothetical - we don't have evidence yet of the argument structure of yod changing in this way. However, there are other verbs for which the scenario is more plausible. In Lhasa Tibetan there are verbs like rag "obtain", rnyed "find", and byung "get" which occur with oblique case-marked "subjects".

These are, btw, prototypical examples of verbs that tend to change their subcat frame over time. We've been working on Icelandic and these are exactly the ones that seem to start out as: book (subj), uncle (obl) and then switch to uncle (subj), book (obj) Many of them turn into experiencer predicates, so that "find" in Icelandic now means "like". So you have "I find this good" --> I like this.

Get and obtain also do this.

So, from my perspective it would be really cool if one could "access" or find these kinds of patterns easily. That is, it would be great if they stood out in some way and didn't look like the usual arg1 and obl pattern that one might find with agentive verbs.

Essentially our proposal consists of two parts:

  1. Make the theme/patient the arg2 if you can. "Many sons" and "book" do not really seem like arg1s in other respects, so follow the semantics.]

Yes, I agree

  1. Build-in flexibility for broadening our definition of arg1 in the long term. Right now the only case-marked nominals that we allow as arg1 are agentive-case marked nominals. But, research on similar languages shows that subjects can have other case markers, like dative case and so on. If we discover evidence of this for Tibetan, then we can make a change in our scheme without requiring us to relabel the arg2.

Here I think the proposal I floated above would work well and naturally.

heacu commented 6 years ago

And my response to Miriam.


in terms of the arc labels themselves, it's not really necessary at this point to differentiate between, say, [arg1:nom] and [arg1:erg]. because the case marker itself (erg) depends on the nominal head in the dependency tree, a hypothetical query for [arg1:erg] would be tantamount to searching for an arg1 which has an ergative case marker depending on it. so we could sort that out.

as regards the other part of your proposal, since there are often 2 (and sometimes 3) unmarked arguments for a single verb, the arg roles are essential now to distinguish them. similarly, it's helpful for us to distinguish ergatives from instrumentals, which are homophonous; we do so by labeling one as arg1 and the other as obl. as for [obl:arg], i think we can follow your advice and phase them into the arg positions. however, to the extent that [obl:arg] is not interfering with the other argument assignments, this phase-in looks like it isn't urgent. i count 147 instances of [obl:arg] in the annotated corpus so far, but no predicate has two of them (let alone, two with the same case). most of these would probably map to arg1 (subjects of yod "have") or arg3 (recipients), maybe some to arg2. so our use of [obl:arg] doesn't look like it's suppressing information, at least not yet.

heacu commented 6 years ago

Starting with the easy bit, I replaced all instances of arg1:lvc with arg2:lvc. Most such instances involve the verb yod. This meant also changing arg2 to arg3, as in the following example:

018b

I asked the yogin, “You know the region where my relatives live. Where is It?” (Mila 018b:T5706)

Incidentally, in the above example I also added in the obl:arg which hadn't been linked.

heacu commented 6 years ago

I searched BRAT for relation type = arg1 with Arg1 = yod or med (using BRAT's Search Relation feature), and replaced all of these cases with arg2. There were a lot of them! Also, I updated the documentation, removing the arg1:lvc page, and updating the arg1, arg2, arg3 and arg2:lvc pages in consequence of this change. I'll now mark this issue as closed.