lukehollis / iip-word-lists

Python utility for creating word lists from epidoc files
1 stars 1 forks source link

<gap> #5

Open emylonas opened 3 years ago

emylonas commented 3 years ago

Generally, we want to ignore <gap> elements. They do not have to be copied. They do have a role in determining word breaks, however -

  1. If there is a space after (or before) the <gap>, then you should use the spaces to determine word breaks.
  2. if there is no space after gap, then the string of characters is not a full word - it's missing letters. In this case, the letters should be enclosed in a <w> element, but with the attribute <w part="y">.
    <gap reason="lost" extent="unknown" unit="character"/>ου
    
    <gap reason="lost" extent="unknown" unit="character"/><w part="y">ου</w>
    ``
zeichman commented 3 years ago

Update to say this should probably also include ignoring when <gap> element is followed by <lb break="no"/>, which often has white space preceding it. For instance, kede0004: Ϊο<gap reason="lost" unit="character" extent="unknown"/> <lb break="no"/>πτης.

The white space should always be ignored when adjacent to a and in this case, it would mean a element is in the middle of a word, meaning the word itself should be ignored for lemmatization.

emylonas commented 3 years ago

Chris and Luke: I will go in and change the <lb break="no"/> so they don't have spaces around them. That will make things a bit easier for Luke.