Open jsbien opened 3 years ago
Absolutely not a feature! I can't easily reproduce this. What software are you using? You don't have cv17 switched on, do you?
You are right, I have in XeLaTeX \addfontfeature{CharacterVariant={17:0}} (it's not intentional, just a forgotten left-over from some earlier experiments).
Just curious, is this the same as U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE?
Yes. I included U+0130 in cv17 as a convenience, even though it also has a Unicode code point. The lowercase character on the same index in cv18 is dotless i U+0131.
I'm sorry for pressing the point, but what kind of convenience do you mean? Again just curious...
I simplified my thinking way too much. I wanted to make dotless i available via a cvNN feature because in a diplomatic edition it's likely that it will be the only form of i, in which case it is a convenience to be able to apply cv18[1] to the whole text rather than enter U+0131 over and over. I also wanted to offer the option of the character being searchable as i (it isn't always), and you get that with use of the feature. In Unicode dotless i U+0131 is case-paired with dotted I U+030, so it made sense to me to put U+030 on the same index in cv17 (the feature for uppercase I).
In Unicode dotless i U+0131 is case-paired with dotted I U+030
No it is not. Default lowercase of U+0130
is U+0069
(see here). On the flip side the upper case of U+0131
is U+0049
. The other way from those is more complicated because you have to check if the language is Turkish or Azerbaijani, but in no case is a dotless lowercase I paired with a dotted upper case.
Right. My bad. Unicode assigning U+0130-1 to this pair is a kind of logical (not case) pairing: "Opposite of default with respect to the dot." And only for some languages. It might make sense to have language-specific variants of these two lookups, though I'm not sure (off the top of my head) how they should work.
Personally I don't know what the utility is in the character variant lookups, the language lookups are the only ones with any utility to me. Any character variants should always still respect the language and script settings first, then provide valid variants inside that scope.
U+0130-1 to this pair is a kind of logical
Logical to what/who? They have no relation to each other in any language that I know of except as "these are the things some other languages have that English ignores because we cross-wired our glyphs first". The fact that Unicode ever encoded the English upper case I as "upercase dotless i" instead of just "uppercase i" hence making languages that actually have both constantly work around the mismatch is bananas. Whether or not it has a dot could have been a language thing, instead they scrambled the pairings.
For most fonts, the ones oriented towards modern languages, I think it's true that language lookups are the most valuable. A font should quietly and invisibly do the right thing for Turkish, Armenian, and any other languages it's designed to work with.
But Junicode is trying to rationalize a large and unruly collection of characters identified by the MUFI project as being of interest to medievalists. Most of this stuff is not labeled as applying to particular language systems (perhaps because, for the most part, it doesn't), and much of it applies not to particular languages but to particular scripts--by which I mean not, say, Latin vs. Greek but rather Gothic vs. Insular minuscule vs. Beneventan--distinctions for which neither Unicode nor OpenType makes any provision.
To make it all messier, MUFI assigns PUA code points to thousands of these characters, with the result that they mean nothing at all to the various apps that process and present text. A medieval text littered with these PUA characters is an accessibility nightmare. The situation with many canonical Unicode characters is not much better. For example, insular f U+A77C is not usually identified by software as a variant of f.
What Junicode's cvNN features (and a few others—especially hlig) try to do is associate these medieval characters with standard Unicode bases so that an editor who wants to display, for example, MUFI "LATIN SMALL LETTER H WITH RIGHT DESCENDER" (encoded at U+F23A) can represent it as a variant of h.
Which, conveniently enough, is exactly what it is.
Show me, @alerque, a way to do all this exclusively with "language lookups" and I will not only adopt your ideas but will also sing your praises till the last trumpet sounds.
I'm sure I've made mistakes, and some of those mistakes spring from my inadequate understanding of particular language systems. Should dotted cap I U+0130 (indicated by MUFI as useful to medievalists, which kind of pries it out of modern language systems) not be presented as a variant of I? Should it not be on the same index in cv17 as the dotless i in cv18 (not that there's any necessary relationship between characters on the same index in adjacent lookups—see here, where we thrashed a lot of this out)?
Is there a better way to organizing this mess? Very glad to hear all ideas.
@psb1558 I'm knee deep in last minute typesetting woes for two Turkish book publishing projects so take that into account. I don't have a proposal to fix Unicode's brain-dead mistake of yester-year, if I did I would have proposed it to the appropriate committee some time ago. Undoing the mistake is virtually impossible at this point, the breakage would be just horrific.
My complaint here was not making things worse than they are. Mistakenly thinking the upper case of ı
is İ
just makes things worse.
Honestly though I think representing characters that exist with Unicode code points as variants of ANSI subset characters is misguided. Anybody doing work in this field should figure out how to enter Unicode code points. I'm interested in this font because it has a glyph for U+F23A
, not because it swaps in that shape for an h
with +cvXX
. If medievalists have a problem entering their data I'd rather see the entry problem solved than see it hacked on the display end my the shaper at the last second.
I will say having dotless lowercase I as a character variant of U+0069
is important, and recommended Unicode practice, for languages like Irish which will use dotless I as a graphical variant of dotted I, and not as a separate letter. Dotless lowercase I as encoded at U+0131
should only be used when the dotlessness carries a meaningful semantic distinction (as in Turkish), not simply for glyphic variations.
The same goes for medievalist letters as above. Sometimes the distinctions are semantic (which is why they have been given codepoints) but sometimes they are not (and the determination of what counts as a “semantic” distinction might vary between intents and purposes). The dedicated codepoint should only be used when the semantic distinction is intentional.
It might still be best to put dotted capital I on a different character variant than dotless small I. In addition to Irish, some languages of Canada use a dotless I (still encoded at U+0069
) to avoid confusion with accented I (í
and ì
), and users would still expect uppercase I to be undotted in those cases.
Yes to everything @mararus-sh just said.
Okay, I think things are coming into better focus now (they might have sooner, but this is a crazy week, full of grading and conferences with students). What I’m interested in is the representation of purely graphical variants while maintaining accessibility. Commenters here are interested in font makers (e.g. myself) not abusing Unicode by using encoded characters like dotless i U+0131 in unintended ways.
Which is precisely what we see in, for example, the MENOTA diplomatic edition of Völuspá, where the dotless i in l. 1 reflects a scribe’s stylistic choice and not a semantic difference. In the same line you see small cap n U+0274 from IPA Extensions used as a purely graphical variant of n, and in the rest of my sample various other letters: insular f and d, r rotunda.
This kind of thing is enabled by a continuing pattern of abuse within Unicode, where code points are routinely assigned to purely graphical variants on the thinnest of pretexts. The example that I know best is the series of insular letters in Latin Extended-D: if those code points are only used “correctly,” as prescribed in the proposal that got them included in Unicode, they will never be used at all.
And in a way, that’s fine with me. Junicode already makes certain moves that are about avoiding the abuse of code points. For example, the small cap n in line 1 of Völuspá, above, is in Junicode at its proper code point, for use by phoneticians. But for use as a graphical variant, Junicode recommends the use of pcap (hat tip to @kenmcd here, who pointed out a feature I had missed), which inserts an unencoded petite cap n that just happens to be the same in appearance as U+0274. This appears to search engines, screen readers, etc. as simply an n—much as a regular (smcp) small cap does.
Junicode routinely does the same thing with MUFI’s PUA characters, when they’re either graphical variants or ligatures—that is, when they wouldn't qualify for a Unicode code point. If you type in the PUA code point, you get an encoded version, and if you use cvNN or hlig you get an unencoded version.
That kind of program could easily be expanded without expanding file size by a lot . . . In fact, would it make sense to impose a rule that says, in essence, a cvNN feature must never swap one Unicode for another?
I would not care what Unicode says/writes about the intended usage as their view may be obsolete. Just use common sense :-) Let me remind that Unicode offered a solution, namely variation sequences, but looks like nobody treats it seriously: http://www.unicode.org/mail-arch/unicode-ml/y2018-m07/0034.html. I think we just need a kind of private variation sequences, what in a sense you are doing with cv. Let me also mention that I consider obsolete the still official Unicode model: input, representation, rendering. I need to encode scans when I still don't know what is a different character and what a graphical variant (e.g. 16th century proposal of Polish spelling). So the workflow is scan/rendering, some input, some representation, analysis, refined representation.
My (possibly poorly‐informed) understanding of why Unicode is encoding all these variant forms now is that they are (possibly) useful in the context of corpuses, identifying the frequencies of form A over form B in a work, dating the work, perhaps determining who wrote it… etc. (?), with the rule of thumb that if two forms co‐occur in the same document, the difference is possibly worth encoding. But for ordinary reading or processing of a document, if form A and form B mean the same thing, of course you want them to be treated similarly (with respect to searching, etc). So that’s the tension I think, between a very strict encoding model where every difference of form is significant, and a looser encoding model where variants should probably be encoded with cvNN
.
* The other situation is, when the reading/significance of a (possible) variant is not clear, they will generally encode the variant in case it actually was intended with a different meaning than the original.
Unicode (and Junicode) needs to support both approaches, so I think the status quo of having both a codepoint and a cvNN
feature is generally good. But yes, I would say that cvNN
should always maintain the identity of the underlying character to keep from confusing things.
** Regarding variation sequences, my understanding of Unicode history is that if a(n alphabetic) character is worthy of a variation sequence, it is usually just deemed worthy of a separate codepoint. Primarily, this is probably because variation selectors are generally ignorable, and with alphabetic characters, two variants not matching one another is generally considered a feature, not a bug. (It would be hard(er) to count the number of instances of ordinary small U+006E
N over small cap U+0274
N in a document if they were both encoded using the U+006E
codepoint. What makes it complicated is that in other situations, treating these two the same is exactly what you want.)
I don't have a lot of contact with Unicode, but from the glimipses I get into their process, they struggle with the (not always straightforward) question of whether a particular character has semantic content of its own or is a variant of another character, and they do reject the variants. (They are very aware that they are not the only game in town: that there is OpenType for representing stylistic variants and the TEI char element for representing in etexts characters that have not been encoded.) That said, they seem to have accepted, e.g. r rotunda U+A75A-B, even though the proposal (which had my name on it even though I had little or nothing to do with it) claimed nothing more for it than that it was a variant form of r, because it was widely used well into the print era, and modern diplomatic editions sometimes reproduce it.
But with the insular characters, the proposers seem to have thought it necessary to make a case that some of them had a distinct phonetic value some of the time, and then the rest of the characters were pulled in on what appear to be largely systemic grounds. For example (from a proposal for a number of medieval characters):
INSULAR D is used in a variety of phonetic contexts. In some instances, it is simply a variant form of d—and in such contexts, it may be appropriate to handle the difference with a font style. In other contexts, however, INSULAR D is clearly distinguished to represent the voiced dental spirant [ð].
The phonetic use of insular d is illustrated from a couple of books: a Cornish grammar published in 1790 and a Welsh dictionary published in 1931 and reprinted 1988. Now I've already been intemperate enough in this thread to get myself in trouble, so I won't say much more, except that, despite what the proposal appears to suggest (that insular d should usually be handled with a "font style"), the code point is routinely used to represent a stylistic variant of d.
I don't think I'm qualified to say whether insular d should have been encoded on the basis of two old printed books—but this sort of thing creates a gray zone where some of what appear to most users to be stylistic variants get the Unicode treatment while others do not. Further, there is (to the naked eye, anyway) inconsistency in the treatment of these in software, so that if you search for "r" in a web page containing r rotunda U+A75B, the search will not find that particular variant of r, while if you search for "f," it will find insular f U+A77C (well, usually).
For printed texts this isn't so critical, but for etexts I'm very interested in supporting a system where stylistic variations (when an edition includes them) are signaled by markup around a plain(ish) character rather than by use of exotic code points (especially PUA code points), and the on-screen presentation of the text is a collaboration of script and font. This kind of system promotes accessibility, searchability, and visibility to search engines.
I'm afraid I don't know much about variation sequences. I suppose it was one of those good ideas that failed to achieve liftoff.
I'm sorry to go on at such length. I am obviously having some difficulty explaining myself concisely.
The silliest abuse of insular letters is on Wikisource, where Irish books in Gaelic type have been encoded using them. If you want to produce a version of these texts using an appropriate font (in epub or pdf) you first have to replace all these exotic characters.
U+0049 has a dot above. It seems to me a bug, but perhaps this is a feature?