lukehollis / iip-word-lists

Python utility for creating word lists from epidoc files
1 stars 1 forks source link

<lb> #7

Open emylonas opened 3 years ago

emylonas commented 3 years ago

Line break elements don't need to be preserved, but they do need to be used in counting the words and lines for the @xml:id attribute.

Simple <lb/> elements always indicate word breaks. <lb break="no"/> elements appear inside words that break over lines. The word id should indicate the line it starts on and its number on that line. The first full word on the new line will be word number 1.

zeichman commented 3 years ago

Is it possible to eliminate white space preceding/following <lb break="no"/>? I see a few examples where words are being broken up incorrectly. For instance, naza0001: <lb/>Ἱεροσολ <lb break="no"/>ύμων

This is being treated as two words (<w>Ἱεροσολ</w>, <w>ύμων</w>), when it should be one word (<w>Ἱεροσολύμων</w>).

It's possible there is some other reason isn't being treated properly, but this is my best guess.