lukehollis / iip-word-lists

Python utility for creating word lists from epidoc files
1 stars 1 forks source link

elements and markup for word breaks (original issue) #1

Open emylonas opened 3 years ago

emylonas commented 3 years ago

<supplied. as part of a word When the supplied element has no space between it and a text node, then it's part of the same word. It is INSIDE the <w> element. <supplied> should always be inside the <w>.

 <p>Ἐπ’ ἱερέως <w>Θρασυδ<supplied reason="lost">άμου</supplied></w>
               <lb/><w>Δαλ<supplied reason="lost">ίου</supplied></w>
            </p>

<supplied> with multiple or partial words in it.

ἁγ<supplied reason="lost">ίῳ τόπῳ προσήνεγκ</supplied>
<lb break="no"/>α
<w>ἁγ<supplied reason="lost">ίῳ</supplied></w> <w><supplied reason="lost">τόπῳ</supplied></w> <w><supplied reason="lost">προσήνεγκ</supplied>
               <lb break="no"/>α</w>

<gap> throw gaps away. But words following them have two different treatments

  1. if there is a space after the gap, it's business as usual.
  2. if there is no space after gap
    <gap reason="lost" extent="unknown" unit="character"/>ου
    
    <gap reason="lost" extent="unknown" unit="character"/><w type="partial">ου</w>
    ``

Also orig with parent <p> and not a child of <choice>, deep copy. with the tag. Same for <num>

<unclear> always is part of a word. keep it inside <w>

spaces: interword spaces don't need to be preserved as far as we know.

what the id should look like. aaaa0000-l-w where l is the line number and w is the word in the line