sasansom / sedes

Metrical position in Greek hexameter.
9 stars 3 forks source link

Investigate incomplete direct quotations #73

Open sasansom opened 2 years ago

sasansom commented 2 years ago

Iliad 24.558 is missing a quote element () at the end of the line. Is there is a way to find out if there are other missing quote elements?

<q direct="unspecified"><l>mh/ pw m' e)s qro/non i(/ze diotrefe\s o)/fra/ ken *(/ektwr</l>
<l>kei=tai e)ni\ klisi/h|sin a)khdh/s, a)lla\ ta/xista</l>
<l n="555">lu=son i(/n' o)fqalmoi=sin i)/dw: su\ de\ de/cai a)/poina</l>
<l>polla/, ta/ toi fe/romen: su\ de\ tw=nd' a)po/naio, kai\ e)/lqois</l>
<l>sh\n e)s patri/da gai=an, e)pei/ me prw=ton e)/asas</l>
<l>au)to/n te zw/ein kai\ o(ra=n fa/os h)eli/oio.</l>
<l><milestone ed="P" unit="para" />to\n d' a)/r' u(po/dra i)dw\n prose/fh po/das w)ku\s *)axilleu/s:</l>
<q direct="unspecified"><l n="560">mhke/ti nu=n m' e)re/qize ge/ron: noe/w de\ kai\ au)to\s</l>
whoopsedesy commented 2 years ago

The q elements in the XML have to be balanced—otherwise the XML would not be well-formed and any tool processing it would crash. There's always an equal number of start tags <q> and end tags </q>, so in general, it's not possible to say where a tag has been omitted. Any </q> you insert to end a quotation need to be balanced by adding a <q> to start a quotation somewhere else, at the very least.

Take this for example:

Alice says <q>Knock, knock. Bob says Who's there?</q>

Notice that while incorrect, the nesting is still well-formed; it just looks like a long quotation from Alice. To make this correct requires making a change that's not automatically inferrable:

Alice says <q>Knock, knock.</q> Bob says <q>Who's there?</q>

But in a specific circumstance we can tell that something has gone wrong, at least if we assume quotations cannot be nested. If we see a new q element start before a previous q element has finished, it means a quotation is nested, and there's probably missing tags that need to be added. For example:

Alice says <q>Knock, knock. Bob says <q>Who's there?</q> Carol says It's me.</q>

This is still well-formed, but it contains a nested quotation, so someone could go in and manually add the missing tags (note that both a </q> and a <q> need to be added):

Alice says <q>Knock, knock.</q> Bob says <q>Who's there?</q> Carol says <q>It's me.</q>

There are probably many tools that can be used to find nested quotations, but one I know is XMLStarlet. Here, //q//q is an XPath expression that means "a q element contained in another q element." First we can count how many nested q elements occur overall:

$ xmlstarlet sel -t -m '//q//q' -f -n corpus/*.xml | uniq -c
    220 corpus/iliad.xml
     49 corpus/nonnusdionysiaca.xml
     84 corpus/odyssey.xml
      1 corpus/theocritus.xml

nonnusdionysiaca.xml contains double-nested quotations and iliad.xml contains up to sextuple-nested quotations:

$ xmlstarlet sel -t -m '//q//q//q' -f -n corpus/*.xml | uniq -c
    106 corpus/iliad.xml
      2 corpus/nonnusdionysiaca.xml
$ xmlstarlet sel -t -m '//q//q//q//q' -f -n corpus/*.xml | uniq -c
     45 corpus/iliad.xml
$ xmlstarlet sel -t -m '//q//q//q//q//q' -f -n corpus/*.xml | uniq -c
     22 corpus/iliad.xml
$ xmlstarlet sel -t -m '//q//q//q//q//q//q' -f -n corpus/*.xml | uniq -c
     10 corpus/iliad.xml
$ xmlstarlet sel -t -m '//q//q//q//q//q//q//q' -f -n corpus/*.xml | uniq -c
      3 corpus/iliad.xml

I'm not saying these are all necessarily errors, but this is where to look first for missing quotation tags.

To pull out the nested quotations from one of the texts for inspection (only shows the enclosed quotation, not the enclosing quotation):

$ xmlstarlet sel -t -m '//q//q' -f -n -c '.' -n -n corpus/theocritus.xml 
corpus/theocritus.xml
<q direct="unspecified">tou=ton e)/rws e)/kteinen. o(doipo/re, mh\ parodeu/sh|s,
<lb/>a)lla\ sta\s to/de le/con: a)phne/a ei)=xen e(tai=ron.</q>

The missing </q> on line 24.558 causes the <q> on line 24.560 to become one of the triple-nested quotations in iliad.xml:

$ xmlstarlet sel -t -m '//q//q//q//q' -f -n -c '.' -n -n corpus/iliad.xml
...
corpus/iliad.xml
<q direct="unspecified"><l n="560">mhke/ti nu=n m' e)re/qize ge/ron: noe/w de\ kai\ au)to\s</l>
<l>*(/ektora/ toi lu=sai, *dio/qen de/ moi a)/ggelos h)=lqe</l>
<l>mh/thr, h(/ m' e)/teken, quga/thr a(li/oio ge/rontos.</l>
<l>kai\ de/ se gignw/skw *pri/ame fresi/n, ou)de/ me lh/qeis,</l>
<l>o(/tti qew=n ti/s s' h)=ge qoa\s e)pi\ nh=as *)axaiw=n.</l>
<l n="565">ou) ga/r ke tlai/h broto\s e)lqe/men, ou)de\ ma/l' h(bw=n,</l>
<l>e)s strato/n: ou)de\ ga\r a)\n fula/kous la/qoi, ou)de/ k' o)xh=a</l>
<l>r(ei=a metoxli/sseie qura/wn h(metera/wn.</l>
<l>tw\ nu=n mh/ moi ma=llon e)n a)/lgesi qumo\n o)ri/nh|s,</l>
<l>mh/ se ge/ron ou)d' au)to\n e)ni\ klisi/h|sin e)a/sw</l>
<l n="570">kai\ i(ke/thn per e)o/nta, *dio\s d' a)li/twmai e)fetma/s.</l></q>
...

In general, you can search for //q//q matches, then scan backward for the most recent <q> tag, then you know that there must be a missing </q> somewhere in between. (And a missing <q> elsewhere to balance it.)