unipv-larl / UD4HL

10 stars 0 forks source link

Tokenization of Ancient Greek #8

Open francescomambrini opened 1 year ago

francescomambrini commented 1 year ago

This post relates to the effort to harmonize the Ancient Greek treebanks, as per Issue 7.

One of the first issues to solve is tokenization itself. The original AGLDT, which was strongly based on the analytical layer of Prague Dependency Treebank, assigned a crucial structural role to coordinating conjunctions (as head of coordinated constructions). For this reason, we chose to split composed conjunctions (negative+coordination, as in English neither or nor) into two tokens.

This is carried over to UD too, although not systematically (I suspect it was already not fully implemented in the original annotation). This is the situation:

I did a very qucik check in PROIEL, and it seems that all the words above are all kept as a single token.

Given that in UD the conjunctions are leaf nodes and neither component play a structural role in the trees, I would normalize tokenization as in PROIEL: 1 single token for all words quoted above. It may be possible to use the negative polarity feature to mark that they have a negative part.

Another case that complicates tokenization in classical Greek is crasis, as in ἐγᾦδα = ἐγὼ οἶδα, or even ἁνὴρ (mind the rough breathing!) = ὁ ἀνὴρ.

In this case, I would definitely make use of the multitoken notation of CoNLL-U, as in my annotation of Sophocles, Philoctetes:

3-4 ἁνὴρ    _   _   _   _   _   _   _   _
3   ὁ   ὁ   DET l-s---mn-   Case=Nom|Definite=Def|Gender=MascNumber=Sing|PronType=Art   4   det _   Ref=866
4   ἀνὴρ    ἀνήρ    NOUN    n-s---mn-   Case=Nom|Gender=Masc|Number=Sing    1   nsubj   _   Ref=866

This, I think, covers the greatest problems in AG tokenization!

mr-martian commented 1 year ago

I agree with splitting all of those cases.

It may be possible to use the negative polarity feature to mark that they have a negative part.

I'm not entirely sure what you mean by this. The ου/μη will of course have a polarity feature, but are you referring to something else?

Also, should we try to standardize the surface forms of the pieces of the tokens? Your example opts for just using the lemmas, but we could also split the surface characters.

francescomambrini commented 1 year ago

I think that, in UD, we should not split οὔτε, οὐδέ, μήτε, μηδέ and εἴτε: there is nothing to gain by splitting them into two tokens and we make life more difficult to an authomatic tokenizer (so that we potentially add inconsistencies and errors).

My proposal is to adopt PROIEL's approach, as in Matth. 6.26 (14857 in grc_proiel-ud-test.conllu):

8   οὐ  οὐ  ADV Df  Polarity=Neg    9   advmod  _   ref=MATT_6.26
9   σπείρουσιν  σπείρω  VERB    V-  Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 1   ccomp   _   ref=MATT_6.26
10  οὐδὲ    οὐδέ    CCONJ   C-  _   11  cc  _   ref=MATT_6.26
11  θερίζουσιν  θερίζω  VERB    V-  Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 9   conj    _   ref=MATT_6.26
12  οὐδὲ    οὐδέ    CCONJ   C-  _   13  cc  _   ref=MATT_6.26
13  συνάγουσιν  συνάγω  VERB    V-  Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 9   conj    _   ref=MATT_6.26

Only, I would add Polarity=Neg to the two οὐδέ (10, 12).

Also, should we try to standardize the surface forms of the pieces of the tokens? Your example opts for just using the lemmas, but we could also split the surface characters.

Yes, my example was with two nominatives, but we should reconstruct the appropriate forms. E.g. (simplified example, with POS and lemma only):

1-2 θἠμέρᾳ  _   _   _   _   _   _   _   _
1   τῇ  ὁ   DET _   _   _   _   _   _
2   ἡμέρᾳ   ἡμέρα   NOUN    _   _   _   _   _   _
mr-martian commented 1 year ago

Ah, I misread your proposal.

So your idea is to split crasis into a MWT with the components having their expected non-crasis forms. Meanwhile the various combinations of negative+conjunction are tagged as single conjunctions with negative polarity?

I slightly prefer splitting, on the grounds that it carries more information, but I'm willing to accept this proposal and wait for a proper morphological layer for the conjunctions.

francescomambrini commented 1 year ago

Yes, exactly! It's always a matter of balancing two conflicting needs:

It's true that negative particles may have their specific scope, i.e. they modify a token that is not the head of the coordination. However, UD gives us an elegant way to express the fact that a token carries a negative component in it with the Polarity feature, so I'd make use of that.

Crasis, on the other hand, is a prosodic phenomenon where two proper words (that are normally kept apart) are fused. In this case, I would defintely prefer not to loose the syntacit information that e.g. in ἁνὴρ we have NOUN + DET.

amir-zeldes commented 1 year ago

I also think the proposal by @francescomambrini sounds right, and FWIW we also kept οὔτε etc. as one token in UD Coptic (of course, there it is a rather opaque loanword)

mr-martian commented 1 year ago

The conjunction part of this is now implemented in the Arborator copy of Ancient Greek-PTNK: https://arboratorgrew.elizia.net/#/projects/Ancient_Greek_Septuagint

I haven't thought of a good way to query for crasis in the current data, and I may have to fix those as they come up.