unipv-larl / UD4HL

10 stars 0 forks source link

Anacoluthon and other ill-formed sentences #6

Open timokorkiakangas opened 1 year ago

timokorkiakangas commented 1 year ago

Dear all,

with non-literary historical language data we often have to make do with what we have, and what we have may sometimes be rather ungrammatical in terms of ill-formed sentences. The UD framework is optimal for well-formed grammatical sentences, but what to do with those others?

In our specific case, we, Hanna-Mari Kupari (@HannaKoo) and Timo Korkiakangas, are trying to annotate a set of 13th-to-16th-century Latin documents from the Apostolic Penitentiary. Timo has discussed the challenge of the non-standard morphology and morphosyntax in a few papers and suggested some solutions for the annotation of forms and constructions that reflect a diachronically advanced stage (~Romance) of a language with a codified grammar (Latin). This time the problem is however not specific forms or constructions, but clearly ill-formed syntactic constructions, mainly sentences that end with a syntactic strategy other than the one they started with (anacolutha). The reason of the anacoluthon is by definition a contamination of two (or more) constructions.

Do you have suggestions on how such cases should be marked within UD? How have you coped with similar issues? Does someone know if there are modern-language treebanks of spoken/informal language with similar instances?

We have used the dep label so far to mark the latter part of the anacoluthon. Please find below a (terrible) example. The indentation marks the subordination levels. If you wish, you can visualize (one interpretation of) the sentence for example in Conllu viewer using this file.

Best,

Timo Korkiakangas and Hanna-Mari Kupari

Supplicatur humiliter vestre sanctitati ex parte devoti vestri Roberti Hylle , apostolica auctoritate notarii et clerici coniugati Battoniensis diocesis ,

quod

cum ipse pro sua sustentatione officii [i.e., officium, obj of exercere] notariatus in aliqua curiarum provincie Cantuariensis et in causis spiritualibus ,

etiam ubi ad correctionem anime ,

etiam si ex iudicis officio , proceditur ,

coram iudice quocumque ecclesiastico seu notionem et iurisdictionem ecclesiasticum habente , etiam inferiore episcopo , ut scriba , exercere ac registrationis seu registri custos existere desideret ,

et obstantibus certis constitutionibus synodalibus seu provincialibus a sede apostolica tamen non confirmatis impune non possit ,

[the sentence seems to break here]

quatinus sedes apostolica [sbj] super premissis

ut non obstantibus constitutionibus predictis exponens ipse dictum officium exercere ac iudex , vicarius et officialis quibuscumque , etiam , ut prefertur , coram episcopo inferiore , [missing verb, probably at least existere]

eum ad dictum notoriatus et custodis officium admittere recipereque impune libere ac licite valeat et possit

secum necnon cum huiusmodi iudicibus , officialibus et vicariis misericorditer dispensare dignemini de gratia speciali .

Some observations:

→ More than two contaminated constructions, but we think one major break is detectable after “impune non possit”.

→ It is difficult to decide where the anacoluthic part(s) begin(s).

→ We have so far interpreted the dignemini part (the latter half of the sentence) to be dependent on the supplicatur part (the first half of the sentence) and labeled this dependency as dep.

→ There is also the possibility that the clause introduced by quatinus is in fact an “anaphoric” repetition of the quod clause. In that case, the anacoluthic break would be after valeat et possit.

→ Also, in similar sentences, the sbj of valeat et possit is typically the supplicant, not the grantor. So, perhaps valeat et possit should have finished the non obstantibus clause, but the scribe has added another anacoluthon, the infinitival eum admittere recipereque, in between.

EricaBiagetti commented 1 year ago

Dear Timo and Hanna-Mari,

Thank you for opening this issue! I remember encountering cases of anacoluthon while annotating the Rigveda. I searched a bit but could only find cases a little different (and less problematic) from yours, like the following, which occurs in a relative construction:

RV 5.60.2 a ā́ yé tasthúḥ pŕ̥ṣatīṣu śrutā́su  b sukhéṣu rudrā́ marúto rátheṣu  c vánā cid ugrāḥ jihate ní vaḥ bhiyā́  d pr̥thivī́ cid rejate párvataś cit 

ab: Those who have mounted on the famed dappled mares, on the well-naved chariots—the Rudras, the Maruts— cd: even the trees duck down with fear of you, powerful ones. Even the earth trembles, even the mountain. (click here for interlinear glosses and different translations)

The hymn is dedicated to the Maruts, which are referred to in the 3rd person (ab: yé tasthúḥ ’those who have mounted’) in the relative clause and addressed in the 2nd person (cd: vocative ugrā … vaḥ bhiyā́ ’with fear of you, oh powerful ones’) in the main clause. In Jamison’s commentary on this verse, we read: “This is hardly unusual in the RV. The standard translation register this anacoluthon in various ways, WG most sharply, by supplying a main clause for ab: “(Sie sind es), die …” and separating the two hemistichs into two sentences. This seems unnecessary.”

I am inclined to follow Jamison and assume a dependency relation holding between the two hemistichs, with cd governing ab. I have been wondering whether cases like this should be annotated as regular relative clauses (see conllu here) or simply with dep, but I would go for the former option since, following Jamison's interpretation, it is clear that the relative clause attaches to the pronoun vaḥ.

From what I understand, in your case it is much more difficult to reconstruct the relationship between the different clauses, so it seems to me that using dep is the best solution.

Sorry I can't be of much help, if I find more examples in the Rigveda I will come back to you and of course I will follow the discussion if other colleagues have different suggestions.

All best, Erica

timokorkiakangas commented 1 year ago

Dear Erica,

thank you for your response. The Ridvega instance highlights the fact that the concept of anacoluthon indeed comprises a wide range of different and differently motivated phenomena - which are rarely discussed in detail.

As far as I understand, your choice to adhere to the relative clause interpretation is very plausible. I think that as long as no additional disambiguation of constructions of this kind is provided, the safest way is to label a dubious construction with the label that is closest to its canonical use (instead of dep, which only tells us that it's a dependency).

Best wishes,

Timo

amir-zeldes commented 1 year ago

An altenative is to use parataxis, which is basically saying 'there is no proper syntactic relationship between these, but they are standing next to each other all the same'

timokorkiakangas commented 1 year ago

Dear Amir,

thank you! We initially also considered parataxis, but since all the constructions listed in the Guidelines under parataxis are normal well-formed constructions (of English) and because the word parataxis usually suggests that it is somehow about the opposite of hypotaxis, we decided to go for dep, which at least underlines the fact that there is something "wrong" there ... On the other hand, there are certainly several subtypes of potentially ill-formed or anomalous sentences, some of which could in fact be best described using parataxis.

amir-zeldes commented 1 year ago

Sure, I'm not saying dep is necessarily wrong here, and it's certainly not an area where there are very well established UD practices. From my perspective, dep says "we know what the head is, and there is a syntactic relation, but we don't have a name for this relationship" (for example, for "Page 5", I have good reasons to think "Page" is the head, but I don't think the relation type falls neatly into one of the existing labels).

I interpret the parataxis label to mean 'here are two things that appear in the same sentence, but there is no real relation between them'. This label is routinely applied in UD for three things: two full, independent predications (or 'sentences' really) which happen to stand in the same orthographic sentence, as in 1.; parentheticals, as in 2.; medial speech predicates, as in 3.

  1. First they want to do it, now they don't want/parataxis to do it... (note we could replace the "," with a ".")
  2. Galois (this is also the name/parataxis of another mathematician) decided/root to have dinner
  3. "I am sick/root and tired," she said/parataxis "of all these discussions about mangos."

These practices are more or less enshrined in the guidelines, but anacoluthons are not. So the question is, are they somehow similar to 1. or 2. (two things cohabit in the same sentence, but have no proper syntactic relation), or are they a proper but unnamed syntactic relation, which would speak for dep. But of course, this is just my take on the situation!

Stormur commented 1 year ago

The problem of parataxis, as said, is that it actually signals the absence of a relation, as it stands in the guidelines. So it is problematic to have cases such as two clauses A and B

because the legitimate question is: how come they are not simply segmented as distinct sentences?

Since:

In the previous examples:

  • First they want to do it, now they don't want/parataxis to do it... (note we could replace the "," with a ".")

This is a case of co-ordination without an explicit conjunction, which is not uncommon. This is made clear by the "correlative" first... now. There is syntactic coherence.

The substitution in my opinion is a dubious argument, since it is valid for any co-ordination or more generally horizontal construction.

  • "I am sick/root and tired," she said/parataxis "of all these discussions about mangos."

I think this is treated by means of ccomp/csubj by current UD guidelines. In previous releases, parataxis was used, but it was not ideal, therefore the amendment.

  • Galois (this is also the name/parataxis of another mathematician) decided/root to have dinner

Here I could agree on parataxis and I would say this is the protoypical case: the main problem is that we do not admit forests (i.e. a graph with more than one tree, with more than one root) as representation of sentences, so we have to find a way to make them stick together through a non-relation.

The passage from the Rigveda

RV 5.60.2 a ā́ yé tasthúḥ pŕ̥ṣatīṣu śrutā́su b sukhéṣu rudrā́ marúto rátheṣu c vánā cid ugrāḥ jihate ní vaḥ bhiyā́ d pr̥thivī́ cid rejate párvataś cit

ab: Those who have mounted on the famed dappled mares, on the well-naved chariots—the Rudras, the Maruts— cd: even the trees duck down with fear of you, powerful ones. Even the earth trembles, even the mountain. (click here for interlinear glosses and different translations)

from what I understand appears in my eyes as a rather typical case of topic-comment structure, and thus should be treated by means of dislocated. We have a heavy "phrase" pushed at the margin of the sentence which acts as theme and is then reprised in the upcoming rheme (fear of you). I see the relative construction as a clue to a similar interpretation, as it is something which creates syntactic coherence.


I remember terrible sentences like the one reported here by @timokorkiakangas and @HannaKoo from the work we did on the Tuscan chartulae (the LLCT corpus). In general, I would say: as far as we can, let's use regular relations, e.g. advcl; if some piece appears out of place, for example a redundant ut 'as,that' (SCONJ), then reparandum seems to be the best fit (and LLCT uses this strategy, in agreement with a semi-official UD stance about that). This relation could be extended to heads of "rogue subordinations".

I agree though that we might conceive a subtype to attach to anacoluthising relations to be able to retrieve similar fractures, for example advcl:anacoluthon, or reparandum:anacoluthon. It would be helpful to point out that some fractures are not just grammatical variations.

Please pardon me @timokorkiakangas if I did not understand 100% this sentence, but looking at it:

I am sorry if I did not understand this sentence well (I fear I am too tired now and I am already rambling :grin: ); take the above suggestions as tentative. But I could also suggest you to rethink its structure in term of two co-ordinated trunks and/or a dislocated "introduction" (dislocations are quite common in LLCT).

timokorkiakangas commented 1 year ago

Thank you, Amir and Flavio, for these intriguing viewpoints on the topic. This is again a demonstration that the Guidelines are open to differing interpretations, something that makes it difficult for new-beginners to approach them. It also calls for a careful reflection on what we are actually representing when we annotate (historical) texts: whether it is the text of a scholarly edition with its modern punctuation, a diplomatic edition of a single manuscript witness of the text, or something else. I guess I’ll return to this issue once we have had time to ponder all the alternatives proposed (i.e., after the Finnish holiday season between June and mid-August!). Best, Timo

amir-zeldes commented 1 year ago

This is a case of co-ordination without an explicit conjunction, which is not uncommon. This is made clear by the "correlative" first... now. There is syntactic coherence.

I don't think the 'correlative' phenomenon here is syntactic - I think it's a semantic phenomenon. After all, we can easily separate them into separate sentences, and there is no syntactic argument structure that requires 'first' to have a 'now', or 'second' or anything else, and vice versa. It's just a semantic expectation, similar to how if something 'begins' we might expect it to 'end' later, but that relationship is not part of the syntax tree.

This is a case of co-ordination without an explicit conjunction

I don't think that's wrong, but the UD guidelines specifically state that such cases are analyzed using parataxis if there is no explicit coordination, for example:

https://universaldependencies.org/u/dep/parataxis.html#side-by-side-sentences-run-on-sentences

I think this is treated by means of ccomp/csubj by current UD guidelines.

No, for medial speech verb it is still treated the same way as a parenthetical. From the guidelines:

https://universaldependencies.org/u/dep/parataxis.html#reported-speech

I'm definitely sympathetic to the idea that valency should be a priority, so I understand why we could want ccomp here, but for the medial cases the guidelines go against this analysis explicitly.

Stormur commented 1 year ago

This is a case of co-ordination without an explicit conjunction, which is not uncommon. This is made clear by the "correlative" first... now. There is syntactic coherence.

I don't think the 'correlative' phenomenon here is syntactic - I think it's a semantic phenomenon. After all, we can easily separate them into separate sentences, and there is no syntactic argument structure that requires 'first' to have a 'now', or 'second' or anything else, and vice versa. It's just a semantic expectation, similar to how if something 'begins' we might expect it to 'end' later, but that relationship is not part of the syntax tree.

This is a case of co-ordination without an explicit conjunction

I don't think that's wrong, but the UD guidelines specifically state that such cases are analyzed using parataxis if there is no explicit coordination, for example:

* Bearded dragons are sight hunters , they need`/parataxis` to see the food to move

https://universaldependencies.org/u/dep/parataxis.html#side-by-side-sentences-run-on-sentences

Looking again at the guidelines, there is indeed lot of indeterminacy left. Under conj, the case of asyndetic co-ordination is explicitly stated at the example of veni, vidi, vici. The description of parataxis does not say that it has to be used when no conjunction is present, but this is just listed as one of the possible cases. Actually, it is said that parataxis is a relation between a "word and other elements", so this covers the examples of parentheses and of interposed verb of saying, but apparently not the first... next or bearded dragons' ones, where we have two more or less balanced clauses. By the way, it would be very inconvenient to use parataxis as a kind of conj subtype.

I think that the correlative phenomenon is at least partially syntactic (but surely not exclusively) if we can observe a symmetric structure in terms of argument positions and morphosyntactic elements realising them. Symmetry is probably one key concept which is not so easy to pinpoint.

I think this is treated by means of ccomp/csubj by current UD guidelines.

No, for medial speech verb it is still treated the same way as a parenthetical. From the guidelines:

* When a speech verb interrupts reported speech content, the interruption is treated as a parenthetical parataxis

  * The guy , John said`/parataxis` , left early in the morning

https://universaldependencies.org/u/dep/parataxis.html#reported-speech

I'm definitely sympathetic to the idea that valency should be a priority, so I understand why we could want ccomp here, but for the medial cases the guidelines go against this analysis explicitly.

I agree for this parenthetical use of speech verbs, but the previous example had the verb at the margin of the clause, so I think that the ccomp/csubj analysis is fully viable and even preferable there. Still, personally the parataxis annotation is not satisfactory... I wonder if we cannot simply accept a kind of systematic non-projectivity in these constructions (what it is all about to prefer parataxis, I suppose).

amir-zeldes commented 1 year ago

the previous example had the verb at the margin of the clause

👍The current guidelines call for a regular ccomp if the verb is at the beginning/end, so indeed it is only the medial case which is treated as a parenthetical.

asyndetic co-ordination is explicitly stated at the example of veni, vidi, vici.... but apparently not the first... next or bearded dragons' ones, where we have two more or less balanced clauses

That's true - right now the guidelines appear to be contradictory, with both the universal conj and parataxis pages laying claim to zero-coordinated full sentences. Let me add @dan-zeman - do you have some thoughts on this situation?

dan-zeman commented 1 year ago

parataxis is a mess and I was never fond of this relation.

Coordination works for nominals as well as clauses — there is no doubt about this. And unlike other constructions, UD uses the same relation type, conj, in both situations.

Languages use different means to signal coordination, and the asyndetic coordination — simply placing conjuncts side by side, without an overt conjunction morpheme — is one of them. There is no good reason why asyndetic coordination should be treated differently from the regular one, or worse, why the special treatment should apply only to coordination of clauses and not to coordination of nominals or other words. In my opinion, it was a mistake to introduce parataxis as an alternative here.

That said, I myself have been telling people that parataxis can be used if they feel their sentence segmentation is wrong but for some reason they cannot fix the segmentation. But I don't have good deciding criteria between parataxis and conj, and if UD did not have parataxis with this problematic use suggestion, I would be instead advising people to use conj.

amir-zeldes commented 1 year ago

Thanks @dan-zeman - this is in line with the current state of the guidelines as well. I for one would like to see more differentiation between parataxis meaning 'parenthetical' (a la PTB PRN) and other uses, but maybe that's something to think of for an eventual UD v3. In any case, I think the guidelines should not simultaneously say that "veni, vidi, vici" is conj while "Bearded dragons are sight hunters , they need to see the food move" is parataxis, so maybe we can discuss this further. Adding @jnivre and @nschneid in case you have more to add.

nschneid commented 1 year ago

IIUC the issue is blurriness of parataxis vs. conj for side-by-side items in a parallel structure where "and" or similar could be inserted. I have wondered about this too. The parataxis page gives this as the explanation of one of its use cases, Side-by-side sentences ("run-on sentences"):

The parataxis relation is used for a pair of what could have been standalone sentences, but which are being treated together as a single sentence. This may happen because sentence segmentation of the sentence was done primarily following the presence of sentence-final punctuation, and these clauses are joined by punctuation such as a colon or comma, or not delimited by punctuation at all. In a spoken corpus, it may happen because what is labeled as a sentence is more commonly an utterance turn. Even if the treebanker is doing the sentence division, it may happen because there seems to be a clear discourse relation linking two clauses. Sometimes there are more than two sentences joined in this way. In this case we make all the later sentences dependents of the first one, to maximize similarity to the analysis used for conjunction.

The last sentence hints at the overlap with coordination. I am not sure whether there is sharp boundary that can be drawn here, but there could be heuristics like:

This fits with the framing of "run-on sentences"—based on my experience in grade school English classes, this means segments for which teachers have to exhort students to use separating punctuation stronger than a comma, as commas are not usually supposed to link independent clauses without a conjunction. (But a skilled writer might do so in certain places for rhetorical effect.)

That said, I would imagine that heuristics would have to be somewhat language-specific depending on different spelling conventions, on top of which there is the challenge of texts without formal standards of sentence-separating punctuation.

I suspect that no matter how we slice it, there will be various kinds of parallel structures/headless juxtapositions within and across languages with different prosodic, functional, and stylistic properties—and whether we want to slot them under one universal relation, or two, or three, there will be a lot of fuzziness and some language/treebank-specific judgment calls will be necessary.

nschneid commented 1 year ago

Stepping away from not-always-reliable cues like punctuation, another idea is to define conj vs. parataxis in terms of prototypes:

This doesn't say what to do in non-prototypical cases. But for universal guidelines, perhaps that's a feature rather than a bug?

dan-zeman commented 1 year ago

Instead of trying to fix this blurriness, I would prefer to jump over it by saying that all of that is conj, while parataxis is reserved for parentheticals.

nschneid commented 1 year ago

That would be a major change. I wouldn't object to rethinking it for UDv3 (with a name change if narrowing it to parentheticals). There will still be difficult cases—for example, here is a "sentence" from EWT:

Not sure I'd feel comfortable calling that coordination or a main part + parenthetical. I'd call it textual juxtaposition of two related items that are formatted as one unit but do not cohere syntactically. "Parataxis" is pretty vague, and therefore broad enough to cover this. See also some of the items in UniversalDependencies/docs#933. And there's the question of what to do with list.

Stormur commented 1 year ago

It is also conceivable to use subtypes, such as for conj:expl.

Defining criteria hinging on punctuation is an absolute no-go in my opinion. Also:

  • conj is prototypically between small units (like words or nominal phrases), and prototypically (at least in many languages) includes an explicit coordinating conjunction, cc

I think it actually is the contrary typologically. Anyway, it is not a universal criterion and I see no use in artificially delimiting co-ordination to undefined "small units", especially not when the interest in co-ordination lies exactly in its flexibility.

That would be a major change.

Would it really be one? I think this is already the line of some treebanks, at least the Latin ones.

  • ---= 19 East/West-Coast Specialized Servers - Total Privacy via Encryption =---

I do not think this relates to the cases discussed here, or the "prototypical" cases of conj/parataxis we have in mind. There is something extralinguistic going on in this example, at a textual level, as you notice.

amir-zeldes commented 1 year ago

That would be a major change. I wouldn't object to rethinking it for UDv3

I'm with @nschneid on this - very many, if not most UD corpora use parataxis to connect two predications which end up next to each other in a sentence without 'and'. I can easily name 10 that do it, and no trivial and reliable way to automatically distinguish which cases of the label are parenthetical, so while I have always been irked at parataxis covering both these uses, and would love to see a parataxis:prn or similar, I don't think it's feasible right now, maybe in the context of a V3 upgrade if we have a lot of people willing to work on it.

I also don't actually agree that it's the same as conj, so ideally I would have liked three labels, or subtypes indeed. The plain parataxis seems best suited to two things standing next to each other based on its meaning, so I would have done parataxis:prn for the other one and left conj out of it.