<gap> - Githubissues

emylonas commented 3 years ago

Generally, we want to ignore <gap> elements. They do not have to be copied. They do have a role in determining word breaks, however -

If there is a space after (or before) the <gap>, then you should use the spaces to determine word breaks.
if there is no space after gap, then the string of characters is not a full word - it's missing letters. In this case, the letters should be enclosed in a <w> element, but with the attribute <w part="y">.
```
<gap reason="lost" extent="unknown" unit="character"/>ου
```
```
<gap reason="lost" extent="unknown" unit="character"/><w part="y">ου</w>
``
```

zeichman commented 3 years ago

Update to say this should probably also include ignoring when <gap> element is followed by <lb break="no"/>, which often has white space preceding it. For instance, kede0004: Ϊο<gap reason="lost" unit="character" extent="unknown"/> <lb break="no"/>πτης.

The white space should always be ignored when adjacent to a and in this case, it would mean a element is in the middle of a word, meaning the word itself should be ignored for lemmatization.

emylonas commented 3 years ago

Chris and Luke: I will go in and change the <lb break="no"/> so they don't have spaces around them. That will make things a bit easier for Luke.

lukehollis / iip-word-lists

<gap> #5