n-t-roff / heirloom-doctools

The Heirloom Documentation Tools: troff, nroff, and related utilities
http://n-t-roff.github.io/heirloom/doctools.html
Other
126 stars 23 forks source link

troff: wrong characters are output for tilde and circumflex #74

Closed reffort closed 6 years ago

reffort commented 6 years ago

In troff's output, the ASCII tilde ~ (U+005E) and circumflex ^ (U+007E) characters are changed to the diacritics ˜ (\[tilde], U+02DC, "small tilde") and ˆ (\[circumflex], U+02C6, "modifier letter circumflex accent"). As a result, examples of program code, paths relative to a home directory, etc., show the wrong characters.

The Unicode 8.0 standard states (section 7.9, p. 329) that the ASCII tilde and circumflex are used only as spacing characters, not as combining characters (diacritics). http://www.unicode.org/versions/Unicode8.0.0/

I found three different ways of handling these characters:

A fourth mechanism I have no firsthand way to verify is described in section 6.2 (p.269) of the Unicode spec: that a raised version of the tilde was "common in older implementations, particularly for terminal emulation and typewriter-style fonts." I have some printed books published between 1985 and about 1997 that show the diacritic forms in the Courier font. One was formatted with AT&T troff, one with EROFF, and several with SQtroff (EROFF and SQtroff were modified derivatives of the AT&T version). It is possible that these are the variant fonts mentioned in the Unicode spec and troff itself was outputting the ASCII characters, but there is probably no way to determine that now. Most books I have that were made with AT&T troff show the ASCII characters.

To get the correct characters, modify afm.c:

diff --git a/troff/troff.d/afm.c b/troff/troff.d/afm.c
index b969f65..7804153 100644
@@ -488,8 +488,8 @@ static const struct asciimap    punctascii[] = {
    { 0x003E,   "greater" },
    { 0x0040,   "at" },
    { 0x005C,   "backslash" },
-   { 0x005E,   "circumflex" },
-   { 0x007E,   "tilde" },
+   { 0x005E,   "asciicircum" },
+   { 0x007E,   "asciitilde" },
    { 0,        NULL }
 };

(The ASCII tilde and circumflex are already mapped correctly in troff/troff.d/otf.c for Mac-encoded fonts.)

At some point before Heirloom 1.482, eqn had been modified to expect the diacritics when it specifies the ASCII characters, but it should specify the characters it needs:

diff --git a/eqn/diacrit.c b/eqn/diacrit.c
index e8265a2..89bc44d 100644
@@ -63,10 +63,18 @@ diacrit(int p1, int type) {
 #endif /* !NEQN */
            break;
        case HAT:
-           printf(".ds %d ^\n", c);
+#ifdef NEQN
+           printf(".ds %d ^\n", c);    // \[asciicircum]
+#else
+           printf(".ds %d \\[circumflex]\n", c);
+#endif
            break;
        case TILDE:
-           printf(".ds %d ~\n", c);
+#ifdef NEQN
+           printf(".ds %d ~\n", c);    // \[asciitilde]
+#else
+           printf(".ds %d \\[tilde]\n", c);
+#endif
            break;
        case DOT:
 #ifndef NEQN

When testing, note that eqn does not position accents and other symbols (vec, bar, etc.) correctly for tall characters ('b', [A-Z], etc.) with .otf OpenType fonts, but they are correct with Type 1 fonts. This is caused by an unrelated troff font handling problem.

neqn does not require any changes as far as I can tell.

Accented characters, such as ê or ñ, are not affected; the diacritic is part of the glyph's design. If a standalone diacritic is required, it can be identified using the same procedure as for the other diacritics, e.g., \[tilde] or \(a~, \[circumflex] or \(a^, or with an ISO keyboard.

reffort commented 6 years ago

I somehow managed to get the first three words wrong. Troff itself outputs the ASCII characters (they are evident in the ditroff output), but those characters are mapped to the wrong glyph names in dpost because of the incorrect mapping in troff/troff.d/afm.c. Thus, the PostScript file displays the wrong characters.

reffort commented 6 years ago

The changes look correct to me.

Digging into this some more to try to make some sense of it, it looks like these characters were deliberately remapped--there is a script in the Plan 9 distribution to change the mapping of the ASCII characters to the diacritics when building the font description files for the standard printer-resident fonts. However, for fonts that do reside on the system, no remapping is done. The troff-generated system documentation was done using the system-resident fonts in the Lucida family, and they show the ASCII characters where they exist in the document source file.

The current n-t-roff font configuration is now set up the same way as Plan 9. In the legacy fonts that are accessed through the "post" driver (with troff -Tpost), the description files for the proportional fonts do not have /asciicircum or /asciitilde defined in them, and as a result they fall back to the S1 punctuation font, which maps these and several other punctuation characters (in all fonts) to the Times font. This produces some strange results; for instance, an ASCII tilde in Helvetica Bold becomes a much smaller regular weight Times diacritic. However, the monospaced fonts, CW and relatives, are mapped in the font description file and do not fall back to S1; that problem seems to have been fixed in Plan 9 and in our current configuration. (groff is still using the legacy configuration, but at 72,000 dpi instead of 720.)

This suggests the possibility of making all of the standard PostScript fonts readily available to the default troff -Tps device by placing the .afm files in the devps directory. Versions of these fonts are included with the GhostScript distribution.