lukehollis / iip-word-lists

Python utility for creating word lists from epidoc files
1 stars 1 forks source link

<supplied> #12

Open lukehollis opened 3 years ago

lukehollis commented 3 years ago

<supplied> goes inside <w>.

Examples: tbas0001

<supplied reason="omitted">ε</supplied>ἴσ<supplied reason="lost">οδόν</supplied>
<supplied reason="lost">σου</supplied>

This consists of 2 words: <w><supplied reason="omitted">ε</supplied>ἴσ<supplied reason="lost">οδόν</supplied></w> <w><supplied reason="lost">σου</supplied></w>

The first string has no spaces between the two <supplied> elements and the other characters of the word that are in between ἴσ The second word is separated by a space from the first word, and it has no spaces inside the <supplied> element. Everything including the supplied can be deep-copied into the <w>.

caes0683

<p>Sanct<supplied reason="lost">o</supplied>
      <lb/>Genio fru<supplied reason="lost">m</supplied>
      <lb break="no"/>entarioru<supplied reason="lost">m</supplied>
      <lb/>omnia
       <lb/>felicia
</p>

should be

<p><w>Sanct<supplied reason="lost">o</supplied></w>
       <w>Genio</w> <w>fru<supplied reason="lost">m</supplied>entarioru<supplied reason="lost">m</supplied></w> <w>omnia</w> <w>felicia</w>
</p>
emylonas commented 3 years ago

Comment on output for supplied: In the spreadsheet for Latin, cell 15D the output is:

<w   xml:id="jeru0554-34" xml:lang="la"><expan><abbr><w xml:id="jeru0554-35" xml:lang="la"><supplied reason="lost">Antoninia</supplied></w>na</abbr><ex>e</ex></expan></w>

it should be

<w   xml:id="jeru0554-34" xml:lang="la"><expan><abbr><supplied reason="lost">Antoninia</supplied>na</abbr><ex>e</ex></expan></w>

Anything inside an <expan> should be copied as is. No added <w> elements. the <expan> is by definition a single word. This is where the reg ex has to have some understanding about the XML structure. One possibility is to have a check that the regex only matches on <supplied> that when it doesn't match the Xpath `//expan//supplied if that makes sense. Depends on the order you are running things in.

zeichman commented 3 years ago

Looking through the CSV file, it looks like words with supplied parts in the middle are still getting split up, when they should be a single word. Looking at askh0003a.xml:

Ant<supplied reason="lost">o</supplied>nini is being rendered as three distinct words: <w>Ant</w> <w><supplied reason="lost">o</supplied></w> <w>nini</w>. Instead, it should be <w>Ant<supplied reason="lost">o</supplied>nini</w>.

This can be seen with a few other instances in the same file, such as <supplied reason="lost">Au</supplied>relio being two words, one being the supplied portion, the other being the "regular" portion of the word.

If there is no space or between the <supplied></supplied> tag and adjacent letters, they should be a single word.

For other examples, see, for instance, caes0528, which is one Latin word, but gets divided up into two. <supplied reason="lost" cert="low">Cae</supplied>sari<lb/> is rendered as <w><supplied reason="lost" cert="low">Cae</supplied></w> <w>sari</w><lb/> instead of what would be correct: <w><supplied reason="lost" cert="low">Cae</supplied>sari</w><lb/>.