Basic Formats for Annotated Rhyme Collections

LinguList commented 2 years ago

We use different formats to annotate rhyme collections, depending on the workflow, but we always use cldfbench to handle datasets.

A first example is now available in hanproj/baxterocrhymes.

I add explanation later, now I need to go.

LinguList commented 2 years ago

@ashhenson, to replicate what I was doing now, please just git-clone the repository and then do the following, after having cd-ed into the repository:

$ pip install -e .

This installs dependencies you will need.

Then, to download the SHIJING data, which I used in the 2016 paper:

$ cldfbench download cldfbench_baxterocrhymes.py

This downloads the data (again, the data is already in raw).

Then, the conversion to cldf is done with the command:

$ cldfbench makecldf cldfbench_baxterocrhymes.py

The resulting CLDF data is in the folder cldf.

LinguList commented 2 years ago

The CLDF data itself is a little relational database in CSV format, connected by the metadata file in JSON. We have our main file called examples.csv (we can also change the name), which contains the SHIJING corpus.

It looks like this:

ID,Language_ID,Primary_Text,Analyzed_Word,Gloss,Translated_Text,Meta_Language_ID,Comment,Poem_ID,Entry_IDS,Stanza_Number,Line_Number,Phrase_Number,Rhyme_Words,Rhyme_Word_Indices,Rhyme_IDS
1,OldChinese,關關雎a鳩,關\t關\t雎\t鳩,,,,,1,,1,1,1,鳩,4,1-1-a
2,OldChinese,在河之a洲。,在\t河\t之\t洲,,,,,1,,1,1,2,洲,4,1-1-a
3,OldChinese,窈宨淑x女,窈\t宨\t淑\t女,,,,,1,,1,2,1,,,

LinguList commented 2 years ago

So we have a language (which is OldChinese), a primary text, an analyzed word (which is the segmented text, using \t as segmentation character here, but this can also be changed, there is the possibility to provide word-level glosses and a translation of the whole text (all standard terms in CLDF, but we don't need them. Essential are other elements I added now, like:

Poem_ID (the ID of the poem in our collection)
Entry_IDS (empty, I'll fill that later, giving a reference to the entries referenced here)
Stanza_Number, Line_Number, Phrase_Number all terms that tell you how to construct the poem from the data, as they are different segmentations, then Rhyme_Words, which is our rhyme words in the phrase, in these cases only one of them, and rhyme_word_indices, which is the index of the rhyme word in the phrase (position 4, would be 3 in Python), and a rhyme ID, which is a combined string of Poem_ID, Stanza_Number, and Rhyme-ID (the a), but I'll later modify this to use only a combination of poem_ID and rhyme-identifier (a) here, as we will annotate globally across a poem.

LinguList commented 2 years ago

How to load the data now? Well, easy, since the data is a Python package, you can install the Python package and thus gain access to the path where the data is on your system:


from cldfbench_baxterocrhymes import Dataset as Baxter
from pycldf import Dataset
import networkx as nx
from collections import defaultdict
import itertools

ds = Dataset.from_metadata(Baxter().cldf_dir / "cldf-metadata.json"))
G = nx.Graph()
nodes = defaultdict(list)
for row in ds.objects("ExampleTable"):
    for char, rid in zip(row.data["Rhyme_Words"], row.data["Rhyme_IDS"]):
        if char in G:
            G.nodes[char]["occurrence"] += 1
        else:
            G.add_node(char, occurrence=1)
        nodes[rid] += [char]
for rid, chars in nodes.items():
    for charA, charB in itertools.combinations(chars, r=2):
        try:
            G[charA][charB]["weight"] += 1
        except:
            G.add_edge(charA, charB, weight=1)

LinguList commented 2 years ago

This yields, by the way, 1848 nodes.

LinguList commented 2 years ago

I'll leave you to play around with this and digest it a bit, I can provide more explanations later, and will also double-check with colleagues working on CLDF.

This could be now made a paper where we introduce how to handle rhyme data in CLDF, which we may submit to Open Research Europe as an open-review paper with a clear intention, and a start-up for the Han project. Robert Forkel would also help us, I assume, as he's totally into these text-examples for CLDF now.

ashhenson commented 2 years ago

Cool. I'll try and get to this today. Thanks, Mattis!

On Thu, 31 Mar 2022 at 18:32, Johann-Mattis List @.***> wrote:

I'll leave you to play around with this and digest it a bit, I can provide more explanations later, and will also double-check with colleagues working on CLDF.

This could be now made a paper where we introduce how to handle rhyme data in CLDF, which we may submit to Open Research Europe as an open-review paper with a clear intention, and a start-up for the Han project. Robert Forkel would also help us, I assume, as he's totally into these text-examples for CLDF now.

— Reply to this email directly, view it on GitHub https://github.com/hanproj/hanproject/issues/18#issuecomment-1084906928, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYII6PQFPHALJ4NNO3CL553VCXOT3ANCNFSM5SFUYFGA . You are receiving this because you were mentioned.Message ID: @.***>

ashhenson commented 2 years ago

I installed cdlfbench according to your instructions. No errors were thrown. But, when I went to run it: ash:baxterocrhymes:>cldfbench makecldf cldfbench_baxterocrhymes.py I got this error: 'Config C:\Users\ash\AppData\Local\cldf\cldf\catalog.ini has no entry for glottolog'

I'm assuming this means I need to install glottolog from here: https://glottolog.org/meta/downloads Which of these do I need? (this is the list from their site) glottolog.sql.gz gzipped PostgreSQL 9.x database dump glottolog_languoid.csv.zip tabular description of Glottolog languoids, one row per languoid, in (zipped) CSV glottolog_source.bib.zip zipped BibTeX file containing all Glottolog references languages_and_dialects_geo.csv CSV file containing geographic locations for Glottolog languages and dialects tree_glottolog_newick.txt Trees for all Glottolog top-level families encoded in Newick format

On Thu, 31 Mar 2022 at 18:34, Ash Henson @.***> wrote:

Cool. I'll try and get to this today. Thanks, Mattis!

On Thu, 31 Mar 2022 at 18:32, Johann-Mattis List @.***> wrote:

I'll leave you to play around with this and digest it a bit, I can provide more explanations later, and will also double-check with colleagues working on CLDF.

This could be now made a paper where we introduce how to handle rhyme data in CLDF, which we may submit to Open Research Europe as an open-review paper with a clear intention, and a start-up for the Han project. Robert Forkel would also help us, I assume, as he's totally into these text-examples for CLDF now.

— Reply to this email directly, view it on GitHub https://github.com/hanproj/hanproject/issues/18#issuecomment-1084906928, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYII6PQFPHALJ4NNO3CL553VCXOT3ANCNFSM5SFUYFGA . You are receiving this because you were mentioned.Message ID: @.***>

LinguList commented 2 years ago

For these cases, we have a blogpost, can I ask you to check first here, and if this does not help you get back to me? You will need glottolog DATA which is different from the Python package pyglottolog, which is on Pypi and installed already.

nh36 commented 2 years ago

@LinguList Can you please take a look at https://github.com/hanproj/received-shi/blob/main/processed-Lu-1983-%E5%85%88%E7%A7%A6%E6%BC%A2%E9%AD%8F%E6%99%89%E5%8D%97%E5%8C%97%E6%9C%9D%E8%A9%A9.txt this is the first text Ash has parsed. Please don't tell us all the ways it makes you unhappy vis à vis CLDF, I know, and we will get there. I just have one question

Q: Right now there is one poem per line, but we want want line of poetry per line. But, there is a lot of stuff (short intro, references to secondary literature etc.) that pertains to the poem and not the line of poetry. So, I give you two options.

We have two version of the file: we keep one version of that data with one poem per line and colums for this 'metadata' and one version of the data that is one line per line where this 'metadata' simply does not appear.
We redundantly include poem 'metadata' for each line of the poem. So, if it is a 10 line poem, then the notes about the author appear in the relevant column repeated ten times, once for each line of the poem.

My guess is that you will want the second type, since it is sort of safer, although it woud get very clutteed. (Of course we are only talking about archival standards, for on-the-lfy work all bests are off.

LinguList commented 2 years ago

There is a third option that I want to ask to be maintained here: Metadata is added in front of the poem with the help of the @metadata: content construct, which we describe in our paper with Chris as an alternative format that can be used for quick annotation.

This format would then render a poem as shown there in a txt file (no TSV!):

@number: 1.1
@title: 大風  
@date: 漢高帝
@author: 劉邦
@author_information:〈邦。字季。沛豐邑中陽里人。初為泗上亭長。秦二世元年。起兵。稱沛公。明年。楚懷王以為碭郡長。封武安侯。以子嬰元年西入關。立為漢王。都南鄭。以漢五年破項羽。即皇帝位。都長安。漢十二年卒。年五十三。謚曰高皇帝。〉    
@references:〖《史記》又名三侯之章。〗
@commentary:〖《漢書》曰：上欲廢太子。立戚夫人子趙王如意。漢十二年。上從破布歸。疾益甚。愈欲易太子。及宴。置酒。太子侍。四人者從太子。年皆八十有餘。鬚眉皓白。衣冠甚偉。為壽已畢。趨去。上目送之。召戚夫人指視曰：我欲易之。四人為之輔。羽翼已成。難動矣。戚夫人泣涕。上曰：為我楚舞。吾為若楚歌。歌曰：〗 
@aftermath:（○《漢書》高帝紀。《史記》高祖紀。《文選》二十八。《史記》樂書三侯之章下司馬貞索隱。《書鈔》一百六。《類聚》四十三。《白帖》十八。《御覽》八、八十七、二百四十一、五百三十九、五百九十一。《樂府詩集》五十八。《詩紀》一。又魏志蔣濟傳作高祖引方一韻。倭名《類聚》一引揚一韻。）

大 風 起 兮 雲 飛 揚。
威 加 海 內 兮 歸 故 鄉。
安 得 猛 士 兮 守 四 方。

LinguList commented 2 years ago

This is the input format we use for rhyme detection and annotation. If there are more than one stanza, one blank line separates each stanza. To separate poems from each other, I recommend two blank lines.

LinguList commented 2 years ago

Note that the @metadata:content construct is free, you just need alphanumeric metadata plus underscore, no dashes, etc. But there should be a readme describing the abbreviations and what we expect in each metadata point.

hanproj / hanproject

Basic Formats for Annotated Rhyme Collections #18