hackerb9 / vt340test

Tests of VT340 compatibility
Creative Commons Zero v1.0 Universal
40 stars 5 forks source link

Unicode notes regarding the DEC Technical character set #29

Open j4james opened 1 year ago

j4james commented 1 year ago

I don't know if this is of any use to you, but these are my notes on the DEC Technical character set, and the mappings (or lack of mappings) to Unicode. You're welcome to use any bits that you think might be helpful in your documentation.

Component Characters

The first 23 glyphs of the DEC Technical character set are known as component characters, intended to be used in the construction of larger mathematical symbols, such as integral and summation signs. This is explained in the Digital ANSI-Compliant Printing Protocol level 2 reference manual (appendix A.4), but can also be inferred from the character names referenced in the DEC STD 070 manual (section 7.5.5).

In the image below, you can see how the various glyphs are intended to connect in order to produce symbols of varying sizes.

image

Over the years 1998 to 2000, proposals was made to add some of these characters to the Unicode standard as part of the Terminal Graphics for Unicode set. However, what we eventually got from that effort was unfortunately not adequate to satisfy the needs of the component symbol structures.

Initially there were code points proposed for the extended square brackets, parentheses, and braces (02/07 to 03/00), but those were ultimately withdrawn in favor of similar characters in the STIX Math set. But the DEC characters were intended to share a single connecting vertical line, whereas the STIX proposal had separate connectors for left and right (and unique for each bracket type), so they don't align correctly in the DEC use case.

Then there are the summation characters, which the Terminal Graphics proposal never covered very well to start with. It appears they weren't aware of all the parts that were intended to combine, so only proposed the top left and bottom left glyphs (03/01 and 03/02), and again these were withdrawn in favor of STIX code points (U+23B2 and U+23B3). Unfortunately those glyphs can't really be used to construct the larger summation symbols that require connectors.

And even if they could, many of the other summation parts weren't defined anyway. The diagonal connectors (03/03 and 03/04) could potentially be mapped to existing diagonal code points in Unicode (U+2572 and U+2571), but there is nothing for the center connector. And the top right and bottom right ends of the symbol were mistakenly thought to be the right half of the ceiling and floor functions, for which there were already code points in Unicode (U+2309 and U+230B). Those glyphs are nothing like what is needed for the the summation symbol, though.

On the positive side, there were already reasonably suitable code points for the integral sign (U+2320 and U+2321), the horizontal and vertical connectors (U+2500 and U+2502), and the top left of the radical symbol (U+250C). And the bottom left of the radical symbol was actually included in the Terminal Graphics proposal, and ultimately accepted as U+23B7 (unfortunately many font faces don't render it correctly, but that's still better than nothing).

To summarize, the table below lists the placeholder code points from the original Terminal Graphics proposal, the existing Unicode code points for those characters that were already deemed to be included in the standard, the code points assigned in the STIX proposal, and finally the one code point from the Terminal Graphics proposal that actually made it into the standard.

TCS Name Proposed Existing STIX Added
02/01 left radical E0B0 U+23B7
02/02 top left radical U+250C
02/03 horizontal connector U+2500
02/04 top integral U+2320
02/05 bottom integral U+2321
02/06 vertical connector U+2502
02/07 top left square bracket E0A4 U+23A1
02/08 bottom left square bracket E0A3 U+23A3
02/09 top right square bracket E0AB U+23A4
02/10 bottom right square bracket E0AA U+23A6
02/11 top left parenthesis E0A2 U+239B
02/12 bottom left parenthesis E0A1 U+239D
02/13 top right parenthesis E0A9 U+239E
02/14 bottom right parenthesis E0A8 U+23A0
02/15 left middle curly brace E0A0 U+23A8
03/00 right middle curly brace E0A5 U+23AC
03/01 top left summation E0AD U+23B2†
03/02 bottom left summation E0AC U+23B3†
03/03 top vertical summation connector U+2572
03/04 bottom vertical summation connector U+2571
03/05 top right summation E0AE U+2309‡
03/06 bottom right summation E0AF U+230B‡
03/07 right middle summation

† These glyphs could work for a simple two-character summation, but are inadequate for building larger symbols with connector elements. ‡ Although these code points were originally proposed for 03/05 and 03/06, the glyphs are wholly inappropriate for use in a summation symbol.

Greek Characters and Mathematical Operators

The remaining characters in the DEC Technical set are a mix of Greek characters and mathematical operators. All of these have existing code points in the Unicode standard - in some cases there are even multiple code points to choose from.

TCS Name Code Point
03/08 - reserved -
03/09 - reserved -
03/10 - reserved -
03/11 - reserved -
03/12 less than or equal U+2264
03/13 not equal U+2260
03/14 greater than or equal U+2265
03/15 integral U+222B
04/00 therefore U+2234
04/01 variation, proportional to U+221D
04/02 infinity U+221E
04/03 division, divided by U+00F7
04/04 capital delta, triangle U+0394/2206
04/05 nabla, del U+2207
04/06 capital phi U+03A6
04/07 capital gamma U+0393
04/08 is approximate to U+223C
04/09 similar or equal to U+2243
04/10 capital theta U+0398
04/11 times, cross product U+00D7/2A2F
04/12 capital lambda U+039B
04/13 if and only if U+21D4
04/14 implies U+21D2
04/15 is identical to U+2261
05/00 capital pi, product U+03A0
05/01 capital psi U+03A8
05/02 - reserved -
05/03 capital sigma, summation U+03A3/2211
05/04 - reserved -
05/05 - reserved -
05/06 radical U+221A
05/07 capital omega, Ohm sign U+03A9/2126
05/08 capital xi U+039E
05/09 capital upsilon U+03A5
05/10 is included in U+2282
05/11 includes U+2283
05/12 intersection U+2229
05/13 union U+222A
05/14 logical and U+2227
05/15 logical or U+2228
06/00 logical not U+00AC
06/01 small alpha U+03B1
06/02 small beta U+03B2
06/03 small chi U+03C7
06/04 small delta U+03B4
06/05 small epsilon U+03B5
06/06 small phi U+03C6
06/07 small gamma U+03B3
06/08 small eta U+03B7
06/09 small iota U+03B9
06/10 small theta U+03B8
06/11 small kappa U+03BA
06/12 small lambda U+03BB
06/13 - reserved -
06/14 small nu U+03BD
06/15 partial derivative U+2202
07/00 small pi U+03C0
07/01 small psi U+03C8
07/02 small rho U+03C1
07/03 small sigma U+03C3
07/04 small tau U+03C4
07/05 - reserved -
07/06 function U+0192
07/07 small omega U+03C9
07/08 small xi U+03BE
07/09 small upsilon U+03C5
07/10 small zeta U+03B6
07/11 left arrow U+2190
07/12 up arrow U+2191
07/13 right arrow U+2192
07/14 down arrow U+2193
j4james commented 1 year ago

By the way, if you want to reproduce my symbol test pattern on your VT340, this is the script I used: https://gist.github.com/j4james/0983acc4f2d5286240100182736f161c

Although be aware that I tweak the color table somewhat to try and make the contrast clearer on the checkerboard patten. I wanted to make it easy to see where the divisions were between the individual characters.

hackerb9 commented 1 year ago

Thanks! That's interesting that people who tried to get the characters into Unicode may not have known what they meant. I wonder if, now that there's a huge area above the BMP open for miscellaneous symbols (and emoji), if the Unicode Consortium would be more willing to reconsider adding in the proper summation characters.

Thanks also for the test pattern. Is that from a genuine or emulated VT340? The "grottiness" of the summation I mentioned earlier referred to the overly pixelated diagonal lines (compared to any other TCS symbol).

How did you create the test pattern? I had been using a text editor after sending a locking shift on the command line, but that doesn't work when trying to show characters from more than one character set, as one would in a real mathematical equation. Do you have any notion of how people back in the day created files with embedded escapes (ISO 2022)? Did their text editors just pass embedded escapes directly to the terminal?

j4james commented 1 year ago

I wonder if, now that there's a huge area above the BMP open for miscellaneous symbols (and emoji), if the Unicode Consortium would be more willing to reconsider adding in the proper summation characters.

Maybe, but I don't think they really like the idea of component characters like this, unless there's a use case for something like data interchange, i.e. you've got documents stored in the DEC Technical character set that you want to convert to Unicode, and I don't think there's evidence of that. Frankly I'm surprised they even allowed the existing Math component characters.

I wouldn't say it's out of the question, but it'd probably require a lot of effort to work through the standard process. The Terminal Graphics proposal went through something like five different drafts, and took a year and half to make it into the standard. Working with standards organizations can be a painful, thankless task, and the results are often disappointing.

Is that from a genuine or emulated VT340? The "grottiness" of the summation I mentioned earlier referred to the overly pixelated diagonal lines

I was using Windows Terminal, which doesn't actually support the TCS set, but I generated an equivalent soft font from your VT340 screenshot. Actually that's possibly something worth including in the repo here, because it's a nice way to add TCS support to terminals that don't have that charset, but which do support soft fonts (see dectech.fnt).

And I saw what you meant about the "grottiness" when I was playing around with the font in my font editor. I was actually somewhat tempted to try smoothing those characters, and also fix some other minor issues, but in the end I thought it best to make it exactly match the VT340 for now, and maybe work on a higher resolution version at a later point in time.

How did you create the test pattern?

Not sure why I didn't just give you the original python script I used to start with, but here it is: dectsc.py. Initially I was just loading the TCS set into G0 with an SCS sequence, writing out a chuck of hardcoded text, and then loading the ASCII set back afterwards (I realise now I'm just assuming G0 is mapped to GL, but that is typically the case).

It became a bit more complicated once I decided to add a couple of ASCII characters to the pattern (( ) [ ] { }), so I have to switch back to ASCII temporarily at one point, and it's a bit hacky.

Do you have any notion of how people back in the day created files with embedded escapes (ISO 2022)?

I'm not sure actually. You might find some examples in the DECUS archives, but I suspect a lot of the legacy software from that era would be proprietary commercial stuff so you may not find a lot of open source.

Edit: I should add that the ANSI standard originally intended the escape sequences to be usable by word processors, so in theory you could have a document format that could be dumped directly to a terminal or printer. Even if the document has an additional encapsulating format, using ANSI escape sequences for the basic markup would make it easier to render and print.

I'm not sure to what extent anyone took advantage of that though. I know there was the Open Document Architecture (ODA) standard, which uses ISO-6429/ECMA-48/ANSI-X3.64 internally, but apparently that wasn't widely adopted. It's probably most notable nowadays for inspiring the 24-bit SGR color sequence that many modern terminals support in some form or another.

hackerb9 commented 1 year ago

I wonder if, now that there's a huge area above the BMP open for miscellaneous symbols (and emoji), if the Unicode Consortium would be more willing to reconsider adding in the proper summation characters.

Maybe, but I don't think they really like the idea of component characters like this, unless there's a use case for something like data interchange, i.e. you've got documents stored in the DEC Technical character set that you want to convert to Unicode, and I don't think there's evidence of that.

I bet one could find some PhD theses that were printed using TCS, but electronic documents, yeah, those as scarce as hen's teeth. And, I'll be honest, I only would want the DEC summation sign in Unicode because character cell terminals are fun, not because there is any practical use. I think the Unicode Consortium was probably right to balk at adding the characters if the claim was that it would help with information interchange. If the DEC summation components ever do make it into Unicode, I think it will be based on the same reasoning as Klingon and most Emojis: the Unicode Consortium found the idea amusing.

Edit: I should add that the ANSI standard originally intended the escape sequences to be usable by word processors, so in theory you could have a document format that could be dumped directly to a terminal or printer. Even if the document has an additional encapsulating format, using ANSI escape sequences for the basic markup would make it easier to render and print.

I didn't know that, but that fits well with DEC's claims of "ANSI compatible printing" and the documentation I've seen of printers that could swap character sets, just like a terminal.

As I think about this, I realize I may have been asking the wrong question. Back then, it might not have been a matter of finding a text editor that "allows" embedding escape sequences as much as whether editors prevented it like they do nowadays. After running,

stty -ctlecho

so that escape sequences I typed wouldn't be shown using caret notation, I was able to use ed to embed (and see) escape sequences and it worked better than expected. (Admittedly, a pretty low bar.)

j4james commented 7 months ago

I just wanted to note here that the upcoming version 16 of Unicode includes a couple more characters that could be useful in mapping the DEC Technical character set.

I didn't see anything that improves on the existing top left and bottom left summation glyphs, and we still have issues with the brackets/parenthesis not aligning with the vertical connector, but we at least now have Unicode code points that vaguely resemble all the required glyphs. And considering how much junk they've been willing to add so far, maybe there is still a chance we'll get some dedicated code points for these glyphs one day.

hackerb9 commented 7 months ago

Awesome research, James. It constantly tickles me that the Unicode consortium, in attempting to put their foot down and say, “Unicode is not for that”, has only made people resort to homoglyphs to get their meaning across. (Mathematicians and YouTubers will write things like 10⁴⁰⸴⁰⁰⁰, superscript comma or no.)

hackerb9 commented 7 months ago

By the way, @j4james, I finally got around to adding your downlineloadable soft font for TCS that you based on my VT340 screenshots. Since you seem pretty handy at manipulating fonts from abstruse formats into something usable in modern times, you may be interested to check out the bitmap font that DEC included on a VMS "freeware" CD and suggested for use with DECTerm, their VT340-ish terminal emulator. It includes TCS in double-wide and double-high as well as "narrow" and "wide" (not sure what those mean) plus bold variants of everything: vwsvt0

hackerb9 commented 7 months ago

I don't know if Unicode cares, but it does seem that the missing TCS characters were used by a number of programs for both display and printing in the early to mid 1980s. There was a program called MEC MASS-11 that was reviewed in the Notices of the American Mathematical Society (“Mass-11 is for the person who wants a WYSIWYG processor for equations ... Large math symbols are built up out of smaller graphics pieces.”) . Mass-11 could run on VAX/VMS, IBM PCs, and the DEC Rainbow.

There was another product called Spellbinder Scientific which the newsletter of the Lawrence-Berkeley National Laboratory gave rave reviews. The American Institute of Physics featured a picture of Spellbinder Scientific's TCS abilities in their inaugural issue of Computers in Physics. image

j4james commented 7 months ago

you may be interested to check out the bitmap font that DEC included on a VMS "freeware" CD and suggested for use with DECTerm

Yes, I'm very much interested in that. I'm working on a soft font editor at the moment, and one of the features I was considering was the ability to import other bitmap font formats. I haven't looked at the pcf file format yet, so I don't know how difficult that would be to support, but I'm keen to give it a try at some point.

The American Institute of Physics featured a picture of Spellbinder Scientific's TCS abilities in their inaugural issue of Computers in Physics.

If anyone does want to submit a proposal to Unicode for the addition of the necessary DEC TCS characters, I think references like this can definitely help. It's actually worth having a look at the proposal for the latest set of legacy terminal glyphs that were added, because you can see the kind of thing they're expecting.

https://www.unicode.org/L2/L2021/21235r-terminals-supplement-noattach.pdf

You'll also notice that they included quite a large number of component characters in that proposal - things like chess pieces divided up into four quarters, and game sprites split into top and bottom parts - so it doesn't seem like the Unicode Consortium has a problem with that anymore.

But considering this was the work of the "Terminals Working Group", it's a little disappointing that nobody thought to suggest the DEC TCS characters, especially given all of the other obscure hardware they did include. But I suppose I can't complain when I haven't been bothered to make the suggestion myself.

hackerb9 commented 7 months ago

A soft font editor that runs on sixel terminals? I had actually just been looking around for something like that! Apparently the GIGI/VK100 had one included (for ReGIS).

Perhaps the Terminals Working Group just didn't know that TCS support is still lacking since there are glyphs that look vaguely similar. I'm in the same non-complaining boat with you, but I'll do what I can to document the need. The TCS page by Paul Flo Williams is a good start, though it appears to have been written before the first proposal by Frank da Cruz.

j4james commented 7 months ago

A soft font editor that runs on sixel terminals?

I'm afraid not. It can edit soft fonts from any of the DEC terminals, but it'll only run on a VT525. It's heavily dependent on level 4 functionality like macros and rectangular area operations, and it also requires color. It's possible I might be able to adapt it to work on monochrome terminals like the VT510 and VT420, but the VT340 would be a step too far. It was originally only intended for personal use.

hackerb9 commented 7 months ago

If you don't mind sharing, I'd love to see your notes on each terminal and its unique font quirks. I'm not going to even attempt to do that, but I'm curious.

Maybe you will inspire me to write up my own, more limited version. I could make it doable by focusing on the VT3xx and VT2xx. I'd only handle frills if they are easy (rectangular pixel aspect ratio for 5x10, 6x10, 7x10; making a matching 132-column font) and skip things that might be tricky (fonts larger than the terminal's character cell size; specialized fonts for different Psgr values; handling non-DRCS fonts, like ReGIS and ROM).

I've got a script that massages the sixels that exist in soft fonts to sixel bitmaps for viewing on any VT340 emulator. Maybe I could quickly magnify the current character by setting a long aspect ratio and inserting ! to repeat each sixel on the screen. (In later terminals did DEC ever allow soft fonts to have repeats in them?) Modifying a bit in a sixel is just a boolean operation on the byte value, so I could... Oh, heck, seems like you have already inspired me. ☺

j4james commented 7 months ago

In addition to different terminals having different cell sizes, there are multiple variants for each terminal - usually 4, but potentially up to 12 (6 screen sizes, and full-cell and text-font variants for each of those). However, those variants should all have the same pixel aspect ratio, so that's at least easier to deal with.

80x24
Full

Text
132x24
Full

Text
80x36
Full

Text
132x36
Full

Text
80x48
Full

Text
132x48
Full

Text
VT2x0 7x10 6x10 5x10
VT320 15x12 12x12 9x12 7x12
VT340 10x20 8x20 6x20 5x20
VT382 12x30 10x30 7x30 6x30
VT420+ 10x16 8x16 6x16 5x16 10x10 8x10 6x10 5x10 10x8 8x8 6x8 5x8

The VT2x0 devices are a bit different, in that they don't include the height (Pcmh) and screen size (Pcss) parameters. The width parameter (Pcmw/Pcms) is a kind of index that covers width, height, and screen size: 2 = 5x10 (132-column text font), 3 = 6x10 (132-column full-cell font), and 4 = 7x10 (80-column text font).

The VT2x0 devices also don't officially support full-cell fonts in 80-column mode, but if you include a pixel in the 8th column of a 7x10 font, that would apparently be duplicated across the padding columns to cover the full cell. That provided a way to generate simple full-cell glyphs like blocks and box characters. However, that only worked on the VT220 - the VT240 just treated the 8th column the same as any other (at least on MAME). Windows Terminal only supports the VT240 interpretation.

The VT382 terminals also have an additional complication. The documentation says that it supports heights of 10, 20, and 30, but it doesn't have different screen heights like later terminals, so that implies it might somehow stretch those smaller heights to fill the cell. And in general, I think all level 3+ terminals are supposed to support VT2x0 fonts, which also implies some kind of stretching would be required. But perhaps they're just centered in the cell (you can at least test how the VT340 handles this).

When it comes to loading existing fonts, though, the biggest complication is that many of them don't actually include the width/height parameters - they just set them to 0, which is supposed to imply the default size for the device. If you're wanting to support multiple devices, that means you have to try and guess the size based on the actual pixel content (amongst other things).

I should also be clear that the above info is to the best of my knowledge, and may not be perfect. The only device I've really tested on is the MAME VT240, but I tested the Windows Terminal implementation with loads of fonts found on the internet, targetting different devices and different screen sizes, and that's assured me that my understanding is likely correct for the most part.

j4james commented 7 months ago

Maybe I could quickly magnify the current character by setting a long aspect ratio and inserting ! to repeat each sixel on the screen.

This is genius btw. I didn't think a sixel-based editor would be practical on the VT340, but this seems like it could work quite efficiently.

In later terminals did DEC ever allow soft fonts to have repeats in them?

Not that I'm aware of, no. And for a typical text font, there's probably not a lot of use cases where there would be any benefit in having a repeat: maybe -, _, and =? So I don't think that would justify the additional complication to the protocol.