A CDLI-ATF to CDLI-Conll Python converter. The CDLI-ATF format is described here.
The full atf corpus is in this repository here.
The goal is to take an atf file and convert it into one cdli-conll file per text.
Cdli-conll isn't an official format. It is our in-house conll version that we will then convert to conll-u format after the texts are annotated.
If you want to know more about conll-u, see here.
Here is an example of what the results should look like :
#new_text=P101049
# ID FORM SEGM XPOSTAG HEAD DEPREL MISC
o.1.1 2(disz)
o.1.2 ma2
o.1.3 1(gesz2)
o.1.4 gur
o.1.5 2(ban2)-ta
o.1.6 ma2-lah5-bi
o.1.7 i3-ib2-u3
o.2.1 u4
o.2.2 3(u)
o.2.3 2(disz)-sze3
o.3.1 sze-bi
o.3.2 4(asz)
o.3.3 1(barig)
o.3.4 2(ban2)
o.3.5 gur
r.1.1 a-pi4-sal4{ki}-ta
r.2.1 nibru{ki}-sze3
r.2.2 siki
r.2.3 ba-a-si
r.3.1 giri3
r.3.2 ur-e11-e
r.4.1 mu
r.4.2 ur-bi2-lum{ki}
r.4.3 ba-hul
This is from the corresponding ATF :
&P101049 = AnOr 01, 058
#atf: lang sux
@tablet
@obverse
1. 2(disz) ma2 1(gesz2) gur 2(ban2)-ta ma2-lah5-bi i3-ib2-u3
#tr.en: 2 barges of 60 gur (capacity), 2 ban2 (per day) each, their skippers piloting,
2. u4 3(u) 2(disz)-sze3
#tr.en: for 32 days,
3. sze-bi 4(asz) 1(barig) 2(ban2) gur
#tr.en: its barley: 4 gur 1 barig 2 ban2;
# calculation: 32 × 2 × 0;0,2 = 4;1,2
@reverse
1. a-pi4-sal4{ki}-ta
#tr.en: from Apisal
2. nibru{ki}-sze3 siki ba-a-si
#tr.en: to Nippur, with wool filled,
3. giri3 ur-e11-e
#tr.en: via Ur-e’e,
4. mu ur-bi2-lum{ki} ba-hul
#tr.en: year: “Urbilum was destroyed.”
# Šulgi 45
You can see that the headers are not the same as in Conll-u.
The file should be names Pnnnnnn.conll , replace the Ns by the actual ID number of the text. The ID is compose of 4 elements, : surface, column, line, word. the surface code is the abbreviation of the surface name (eg. obverse = o), if there are columns, there are none in this text, then add "col" and the column number, then line and word number.
Python version 2.7.x
Pass the filename argument to the code like this:
$ python Convertor.py input.txt
For more information, contact me.