kshawkin / Best-Practices-for-TEI-in-Libraries

Best Practices for TEI in Libraries: A guide for mass digitization, automated workflows, and promotion of interoperability with XML using the TEI
http://purl.oclc.org/NET/teiinlibraries
32 stars 8 forks source link

discussion of hyphenation: mismatch between code sample and note #96

Open kshawkin opened 2 years ago

kshawkin commented 2 years ago

In the third row of our table, we have:

Colloquial name Appearance in source document Encoding Note
Soft hyphen UTF-8 is a char- acter encoding for Unicode. UTF-8 is a char<pc force="strong">-</pc><lb break="yes"/>acter encoding for Unicode. As in the first example, the use of weak as the value of force indicates that the encoder considers "character" to be a single orthographic token where the hyphen is only indicating that the word is broken across a line. The use of no as the value of break also indicates that the line break occurs inside an orthographic token (single word) which is broken across a line.

The code sample involves force="strong" and break="yes", but the note implies that it has force="weak" and break="no". It's been too long since I thought about any of this, so I'm not even sure what is correct here. I vaguely recall that @sydb wrote this section?

emylonas commented 2 years ago

The Note is correct and the encoding incorrect. It should be <lb break="no"/> when the line break is inside a word. I just checked the Guidelines on <pc> and again, the encoded example is backwards. The @force attribute is "strong" when the punctuation mark is a word separator, and "weak" when it is not. In this case, the hyphen appears in side the word "character" so it doesn't serve as a word break character.

I think it should be:

char<pc force="weak">-</pc><lb break="no"/>acter

Also, we might try to force a linebreak where the hyphen is in the source document rendition so the hyphen doesn't look odd.

kshawkin commented 2 years ago

Thank you for the quick detective work!