unipv-larl / UD4HL

10 stars 0 forks source link

harmonising Latin treebanks #5

Open TCGWim opened 1 year ago

TCGWim commented 1 year ago

Currently, we have five (relatively small) treebanks for Latin. It would be nice if we could join these treebanks into one bigger treebank for training neural network models. But the annotations of these five treebanks are not consistent. Different annotation decisions have been made for the different treebanks. Also, some of the treebanks are converted from other, non-UD, annotation guidelines, with possible limitations in following the UD guidelines.

amir-zeldes commented 1 year ago

Relevant paper: https://aclanthology.org/2023.udw-1.2.pdf

Stormur commented 1 year ago

Hi, Wim!

I am commenting on this as one of the curators of the active Latin treebanks at CIRCSE at the Università Cattolica of Milan (cf. documentation page).

On the issue, if you don't already know it, I can first of all also point you (and anybody who is interested) to this excellent work by Gamba & Zeman as a sort of introduction.

In short I could summarise the current situation as follows:

All in all, after the aforementioned harmonisation will have taken place, I would say that most differences will lie in lemmatisation and to a lesser extent part-of-speech tagging choices. Also, LLCT might provide some "interferences".

But I think that some experiment even in the current situation with the three active treebanks can already bring to interesting results, as they are "converging". We would of course be very happy to help you with them!

hanneme commented 1 year ago

The PROIEL conversion has been updated, so it will no longer be neglected after the upcoming release. The new conversion takes on board many points from the article Amir cites. However, it also needs to stay in sync with the other PROIEL treebanks (Greek, Old Church Slavonic, Gothic also exist in UD conversion), so language-internal consistency isn't the only issue at hand here.

fre. 5. mai 2023 kl. 17:24 skrev Flavio @.***>:

Hi, Wim!

I am commenting on this as one of the curators of the active Latin treebanks at CIRCSE at the Università Cattolica of Milan (cf. documentation page https://universaldependencies.org/la/index.html).

On the issue, if you don't already know it, I can first of all also point you (and anybody who is interested) to this excellent work https://aclanthology.org/2023.udw-1.2/ by Gamba & Zeman as a sort of introduction.

In short I could summarise the current situation as follows:

  • the three active treebanks (IT-TB, LLCT, UDante) are slowly getting closer to each other with each release, as they are curated by the same team and each correction is implemented with a common logic
  • we will discuss how to include (and I am personally convinced this is a necessary step) the harmonising changes detailed in the previously mentioned paper in the next release (unfortunately we skipped the timing for the current one); this will bring the three treebanks even closer to each other
  • the two neglected treebanks (PROIEL and Perseus) are not to be considered at all at the moment, not until some intervention takes place
  • a real challenge however is the treatment of morphological features in LLCT, as the peculiar register of Latin there brought to an annotation style which slightly deviates from UD's standard.

All in all, after the aforementioned harmonisation will have taken place, I would say that most differences will lie in lemmatisation and to a lesser extent part-of-speech tagging choices. Also, LLCT might provide some "interferences".

But I think that some experiment even in the current situation with the three active treebanks can already bring interesting results. We would of course be very happy to help you with them!

— Reply to this email directly, view it on GitHub https://github.com/unipv-larl/UD4HL/issues/5#issuecomment-1536491771, or unsubscribe https://github.com/notifications/unsubscribe-auth/AENLHY2DP2AS63HHW2PZH5DXEUSTFANCNFSM6AAAAAAXXJT27M . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Stormur commented 1 year ago

Ah, very good to know!

TCGWim commented 1 year ago

Dear Amir, Flavio, Hanneme, Thank you for the quick replies. I will study the Gamba/Zeman paper.

Some years ago, I worked with Marco Passarotti (ITTB), Dag Haug (PROIEL) and Giuseppe Celano (Perseus) on hamonisation of the three treebanks. I am glad action has now been taken to modify the treebanks based on common agreements on the annotations.

Wim.

timokorkiakangas commented 1 year ago

Dear Wim, Flavio, Hanne and Amir,

very important this theme of harmonizing, and of course something that the LiLa team has been preparing the ground for in the recent years. The Gamba & Zeman paper was new to me as well, so need to take ime to read it.

As for LLCT, it is a bit misleading to consider it on a par of the other Latin treebanks as i) it's based on a very Late Latin (8th and 9th c.) charter corpus, with hundreds of formulaic texts repeating the same text passages, hence the completely unnatural distributions of linguistic features in comparison to other treebanks, ii) the morphology and syntax of LLCT is not "Latin" in the traditional sense. This is why I'm always alarmed when I see LLCT is being used at face value with other treebanks. LLCT's main raison d'être is to provide a comparative data for diachronic change in Latin.

We are actually planning to use the existing Latin treebanks to parse two large parsebanks on medieval Latin in the upcoming months. Preliminary tests show how difficult that will be, as the treebanks are so different in various ways. In fact, the problems with the UD Latin models begin with the tokenization, with highly differing sentence boundaries and enclitic divisions with each model. (We also plan to train our own models though.)

Timo