Open lukehollis opened 3 years ago
Comment on output for supplied: In the spreadsheet for Latin, cell 15D the output is:
<w xml:id="jeru0554-34" xml:lang="la"><expan><abbr><w xml:id="jeru0554-35" xml:lang="la"><supplied reason="lost">Antoninia</supplied></w>na</abbr><ex>e</ex></expan></w>
it should be
<w xml:id="jeru0554-34" xml:lang="la"><expan><abbr><supplied reason="lost">Antoninia</supplied>na</abbr><ex>e</ex></expan></w>
Anything inside an <expan>
should be copied as is. No added <w>
elements. the <expan>
is by definition a single word.
This is where the reg ex has to have some understanding about the XML structure. One possibility is to have a check that the regex only matches on <supplied>
that when it doesn't match the Xpath `//expan//supplied if that makes sense. Depends on the order you are running things in.
Looking through the CSV file, it looks like words with supplied parts in the middle are still getting split up, when they should be a single word. Looking at askh0003a.xml:
Ant<supplied reason="lost">o</supplied>nini
is being rendered as three distinct words: <w>Ant</w> <w><supplied reason="lost">o</supplied></w> <w>nini</w>
. Instead, it should be <w>Ant<supplied reason="lost">o</supplied>nini</w>
.
This can be seen with a few other instances in the same file, such as <supplied reason="lost">Au</supplied>relio
being two words, one being the supplied portion, the other being the "regular" portion of the word.
If there is no space or
between the <supplied></supplied>
tag and adjacent letters, they should be a single word.
For other examples, see, for instance, caes0528, which is one Latin word, but gets divided up into two.
<supplied reason="lost" cert="low">Cae</supplied>sari<lb/>
is rendered as <w><supplied reason="lost" cert="low">Cae</supplied></w> <w>sari</w><lb/>
instead of what would be correct: <w><supplied reason="lost" cert="low">Cae</supplied>sari</w><lb/>
.
<supplied>
goes inside<w>
.Examples: tbas0001
This consists of 2 words:
<w><supplied reason="omitted">ε</supplied>ἴσ<supplied reason="lost">οδόν</supplied></w> <w><supplied reason="lost">σου</supplied></w>
The first string has no spaces between the two
<supplied>
elements and the other characters of the word that are in betweenἴσ
The second word is separated by a space from the first word, and it has no spaces inside the<supplied>
element. Everything including the supplied can be deep-copied into the<w>
.caes0683
should be