[UFO4] Separating unicode from glif

justvanrossum commented 5 years ago

For UFO4 we should consider removing the <unicode> element from the glif format, in favor of a ufo-global character mapping file, say font.ufo/cmap.plist, which would map unicode values to glyph names.

typesupply commented 5 years ago

I think this is a good idea. This wasn't a problem in the single layer days, but it is now. Glyphs with the same name may have separate Unicode values across layers. There's probably a use case for that, but the general use case is to have one value per name. So, a universal mapping makes sense now.

Here's a quick sketch of how I think we could make this work without suddenly breaking lots of code:

The <unicode> element continues to be written into GLIF. The spec will say that it is for reference and backwards compatibility only. It will be a vestige similar to the <name> element.
A new cmap.plist will be added to the top level of the UFO. It will be a dict with format { name : [int, ...] } If this is present, a reader must ignore all unicode elements in GLIF files.
The UFO format version will be bumped to 3.1.

Thoughts?

justvanrossum commented 5 years ago

A new cmap.plist will be added to the top level of the UFO. It will be a dict with format { name : [int, ...] } If this is present, a reader must ignore all unicode elements in GLIF files.

Strongly disagree. It should be {unicodeInt: glyphName}, where unicodeInt could be encoded as a hex string (plists don't like ints as dict keys after all).

typesupply commented 5 years ago

Strongly disagree. It should be {unicodeInt: glyphName}, where unicodeInt could be ncoded as a hex string (plists don't like ints as dict keys after all).

Oops, yeah, I had a brain fart. I was thinking about glyphs with multiple Unicode values. This is a much cleaner way to handle that.

justvanrossum commented 5 years ago

A main design goal also is to make it impossible to have multiple glyphs using the same code point.

justvanrossum commented 5 years ago

I see two options to format the keys in cmap.plist:

stringified int: str(codePoint), no zero-padding.
hexified int, uppercase letters, using zero-padded for the BMP, 5 or 6 digits above BMP: "%04X" % codePoint

I'm leaning towards the second option.

typesupply commented 5 years ago

I prefer the hexified int.

justvanrossum commented 5 years ago

The formatting should be strictly specified to avoid 01FFFF vs 1FFFF ambiguities.

typesupply commented 5 years ago

Here's the relevant part of the GLIF spec.

This is refreshing my memory on the development of the <unicode> element… We had a couple of things that we had to solve:

If a glyph has > 1 code points, how to we indicate the primary one? We handled this in GLIF by saying that the first appearance of a <unicode> defined the primary code point.

I don't know how we'd handle this in cmap.plist if the key is the code point.

We struggled (aka "were annoyed with") fonts with > 1 glyph mapped to the same code point. I think Verdana had this situation and we were worried about round tripping.

Should we worry about this now? It's such an odd edge case.

justvanrossum commented 5 years ago

It's not an edge case, and by mapping {codePoint: glyphName} all is good.

f["Omega"].unicodes = [0x2126, 0x03A9]  # OHM SIGN, GREEK CAPITAL LETTER OMEGA

vs

cmap = {
    0x2126: "Omega",
    0x03A9: "Omega",
}

justvanrossum commented 5 years ago

If you mean how do determine from a cmap which is the primary unicode, then yes, that can not be done unambiguously. The concept of "primary unicode value" is flawed, though, and not really needed.

typesupply commented 5 years ago

If you mean how do determine from a cmap which is the primary unicode, then yes, that can not be done unambiguously.

Yes. That's what I mean.

justvanrossum commented 5 years ago

Ok, that is then indeed a bw compat issue we can't easily solve.

justvanrossum commented 5 years ago

But again, that's only a problem if the concept "primary unicode value" has any value. I think it only becomes problematic in code that is too lazy to properly deal with glyph.unicodes and just only deals with glyph.unicode. (I've done that many times myself, ha.)

typesupply commented 5 years ago

If we structured the plist as { name : [hex string, …]} we could preserve the "primary" indication by saying that the first is the primary. To avoid duplicate code points we could make a note that they are not allowed. We do that in other parts of the spec. Hm.

justvanrossum commented 5 years ago

I really think that code not dealing with glyph.unicodes properly is broken, and that the order should not have semantic meaning.

moyogo commented 5 years ago

It would be nice if UVS (Unicode Variation Sequences) were supported (see https://docs.microsoft.com/en-us/typography/opentype/spec/cmap#format-14-unicode-variation-sequences or https://en.wikipedia.org/wiki/Variant_form_(Unicode)). These require sequences of two unicodes (base character and variation selector character) instead of a one unicode at a time.

These are useful for CJK, Mongolian, mathematical symbols, emojis and other things.

typesupply commented 5 years ago

It would be nice if UVS (Unicode Variation Sequences) were supported

Do you have any suggestions for how to do this? I don't know much about these.

justvanrossum commented 5 years ago

@moyogo: It seems a format 14 cmap subtable is always used together with a regular cmap subtable. So I guess we would be talking about an additional mapping, next to cmap.plist. Perhaps uvs.plist.

Semantically and practically, I think a structure like this would be most appropriate:

uvs = {
    unicodeVariationSelector1: ({default1, default2, ...}, {nonDefault1: glyphName1, nonDefault2: glyphName2, ...}),
}

The first example in the spec would then look like this:

cmap = {
    0x82A6: "cid7961",
}

uvs = {
    0xE0100: (set(), {0x82A6: "cid1142"}),
    0xE0101: ({0x82A6}, {}),
}

The uvs.plist file could look like this (using array instead of set):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>E0100</key>
  <array>
    <array/>
    <dict>
      <key>82A6</key>
      <string>cid1142</string>
    </dict>
  </array>
  <key>E0101</key>
  <array>
    <array>
      <string>82A6</string>
    </array>
    <dict/>
  </array>
</dict>
</plist>

justvanrossum commented 5 years ago

Or, looking at the internals of the fonttools format 14 implementation, perhaps this is better:

uvs = {
    0xE0100: {0x82A6: "cid1142"},  # non-default
    0xE0101: {0x82A6: None},  # default, refer to cmap
}

Slightly nicer plist, too (too bad plist doesn't support None...):

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>E0100</key>
    <dict>
      <key>82A6</key>
      <string>cid1142</string>
    </dict>
    <key>E0101</key>
    <dict>
      <key>82A6</key>
      <string></string>
    </dict>
  </dict>
</plist>

justvanrossum commented 5 years ago

The TTX dump writes a glyph name of "None" for a default variation. I like an empty string better as nobody will stop you to name a glyph "None" :)

moyogo commented 5 years ago

UVS are a mechanism to have glyph variants at the Unicode character level. For example, if <0030> is zero then <0030 FE00> is the standard sequence to get zero.slashed, no GSUB only cmap is involved.

justvanrossum commented 5 years ago

Given a Unicode Variations Sequences subtable, converted to Python as in my last comment, the following code is my best guess of how the variation selection process works:

def isVariationSelector(c):
    return 0xFE00 <= c <= 0xFE0F or 0xE0100 <= c <= 0xE01EF

def glyphsFromText(text, cmap, uvs):
    text = [ord(c) for c in text]
    glyphs = []
    for i, c in enumerate(text):
        if isVariationSelector(c):
            if i > 0:
                glyphName = uvs.get(c, {}).get(text[i-1], None)
                if glyphName is not None:
                    glyphs[-1] = glyphName
        else:
            glyphs.append(cmap.get(c, ".notdef"))
    return glyphs

cmap = {
    0x82A6: "cid7961",
}

uvs = {
    0xE0100: {0x82A6: "cid1142"},  # non-default
    0xE0101: {0x82A6: None},  # default, refer to cmap
}

print(glyphsFromText("\u82A6\U000E0100", cmap, uvs))
print(glyphsFromText("\u82A6\U000E0101", cmap, uvs))

justvanrossum commented 5 years ago

Alright, more thoughts on how to add a cmap file to the UFO format, as well as support for Unicode Variation Sequences.

Cmap:

I'd like to map unicode code points to glyph names.
The plist format doesn't support non-strings as keys, so we'd have to convert the code points to strings first, and vice versa upon reading. So we need to do more parsing on top of plist, therefore choosing the plist format doesn't buy us as much.
How about we write the cmap as a text file, with two fields per line: the unicode value as hex and the glyph name, separated by whitespace. I bet this is faster to parse than plist.
So, this becomes font.ufo/cmap.txt.
Requirements:
- the file must be sorted by unicode value (as int, not lexicographic via the hex representation),
- unicode values must be unique
- no blank lines

Unicode Variation Sequences:

The UVS data can be represented by a sequence of (unicodeValue, variationSelector, glyphName) tuples, where glyphName is optional. No glyph name means: this is the default variation, and the cmap should be used to find the glyph name for this code point.
Similarly to cmap, the plist format will not be that great to store this information.
So, just as with cmap, I suggest to define a very simple text format: one variation sequence per line, two or three fields per line, separated by whitespace: unicode value as hex, variation selector as hex, and optionally a glyph name.
This can be efficiently parsed and converted to a datastructure of choice.
Could be named font.ufo/uvs.txt
Requirements:
- unicode-value/variation-selector pairs must be unique
- lines must be sorted lexicographically by the unicode-value/variation-selector pair as integer values (not their hex representations)
- no blank lines

justvanrossum commented 5 years ago

anthrotype commented 5 years ago

I like the idea of simple space-separated text files. The format is simple enough that the parser can be a one-liner.

As you already noted, if this cmap.txt file maps from unicode values to glyph names, then the "primary" unicode value for a glyph can no longer be defined, so APIs like this _set_unicode one in defcon would have to be deprecated:

https://github.com/typesupply/defcon/blob/9c81776dd782f939142cd9e7c1c047feabea479b/Lib/defcon/objects/glyph.py#L234-L242

I agree the idea of a primary unicode value for a glyph is flawed, but if we really wished to keep it around, we could say this cmap.txt is not required to be sorted by unicode value, and the first mapping that appears for a given glyph name in this ordered cmap list is considered the "primary" unicode value for that glyph. I don't know if it's worth it, though.

justvanrossum commented 5 years ago

I think the long term consequence is that glyphs will eventually neither have a g.unicode nor a g.unicodes attribute at all. Both attributes would have to be deprecated.

To loosen the sort requirement is a nice idea if we indeed must hold on to the notion of "primary unicode", but can some other sorting requirement be invented that ensures a deterministic order? I'd hate it if various tools would output equivalent but differently sorted cmap files.

anthrotype commented 5 years ago

how about sorting cmap.txt by the glyph name instead of the unicode value, then within the group of mappings that share the same glyph name, the order is user-defined?

(note: i'm still leaning towards simplicity [sorting by unicode] and deprecating the notion of "primary" unicode value)

justvanrossum commented 5 years ago

how about sorting cmap.txt by the glyph name instead of the unicode value, then within the group of mappings that share the same glyph name, the order is user-defined?

That could work.

(note: i'm still leaning towards simplicity [sorting by unicode] and deprecating the notion of "primary" unicode value)

Yeah. I'm curious to hear others about this issue.

As I wrote before, I think any breakage can only come from code using g.unicode in the presence of multiple unicode values in g.unicodes. I think it's fair to claim that such code is already broken.

typesupply commented 5 years ago

Could someone open another issue for Unicode Variation Sequence support? I don't want the discussion of that to be hard to find in the future.

moyogo commented 5 years ago

Note that the UFO3 glyph name description has no restriction:

The name of the glyph. This must be at least one character long. Different font specifications, such as OpenType, often have their own glyph name restrictions. Authoring tools should not make assumptions about the validity of a glyph’s name for a particular font specification.

A glyph name including whitespace is valid.

justvanrossum commented 5 years ago

A glyph name including whitespace is valid.

Ahhh, there's the catch. Thanks for pointing that out.

We could work around that by saying cmap.txt and uvs.txt must be tab-separated. (Can we please say a tab character is not legal within a glyph name?)

typesupply commented 5 years ago

I'm still not sure why we're throwing out {name : [hex, ...]}. Is there a technical reason? I get that you guys don't like it, but why? It will work with plist and is backwards compatible with the "lazy," "flawed" and "broken" code that deals with glyph.unicode instead of glyph.unicodes. I'm not arguing for this or against that. I just want to understand the reasoning behind the opinions.

justvanrossum commented 5 years ago

semantically, {unicode: glyphName} matches exactly with what will end up in a font
speed: when I open a UFO I want to be able to get at the cmap easily and quickly. I need to know that ord("a") maps to "a" before I need to know that "a" is reachable from ord("a").
avoids accidentally using the same unicode for multiple glyphs, avoiding ambiguities

typesupply commented 5 years ago

Okay, thanks. This gives me much more info for comparing the options. I'm going to kick the tires now. Please don't yell at me…

semantically, {unicode: glyphName} matches exactly with what will end up in a font

Semantically, kerning.plist, groups.plist, etc. do not match what will end up in the font. 😉 We don't usually pay much attention to output formats.

speed: when I open a UFO I want to be able to get at the cmap easily and quickly. I need to know that ord("a") maps to "a" before I need to know that "a" is reachable from ord("a").

How big of a speed issue is this in the bigger picture? It will be a file that is read only once per UFO load so even if there's an extra step of creating a flipped dict after plist read it's only going to happen once. The storage format for kerning.plist is a good reference for concerns about complexity. Plist didn't support the structure we liked for kerning data ({("name", "name") : value}) but instead of inventing a new format and all of the necessary edge case handling we modified the structure of the kerning data so that it could be stored in plist and built a translation layer into ufoLib.

avoids accidentally using the same unicode for multiple glyphs, avoiding ambiguities

I don't think this should be considered a pro or a con. There are many, many places in the UFO where duplicates or ambiguities may be introduced. We handle this at the spec level.

We could work around that by saying cmap.txt and uvs.txt must be tab-separated. (Can we please say a tab character is not legal within a glyph name?)

I don't want to change other parts of the spec just because it makes parsing a text file easier. Can there be an escape if there is a tab in a glyph name?

justvanrossum commented 5 years ago

semantically, {unicode: glyphName} matches exactly with what will end up in a font

Semantically, kerning.plist, groups.plist, etc. do not match what will end up in the font. 😉 We don't usually pay much attention to output formats.

Note how I wrote "a font" and not "an OpenType font". The way a cmap works for any kind of font engine is that it maps unicode values to glyphs, and never the other way around. It's such a fundamental thing.

speed: when I open a UFO I want to be able to get at the cmap easily and quickly. I need to know that ord("a") maps to "a" before I need to know that "a" is reachable from ord("a").

How big of a speed issue is this in the bigger picture?

This is my weakest argument, so let me give in on that one :)

avoids accidentally using the same unicode for multiple glyphs, avoiding ambiguities

I don't think this should be considered a pro or a con. There are many, many places in the UFO where duplicates or ambiguities may be introduced. We handle this at the spec level.

The fact that ambiguities exist elsewhere in the spec is no reason to not try and avoid it here, especially if the solution is so trivial and obviously correct. Inherent correctness is better than correctness-that-needs-to-be-verified.

We could work around that by saying cmap.txt and uvs.txt must be tab-separated. (Can we please say a tab character is not legal within a glyph name?)

I don't want to change other parts of the spec just because it makes parsing a text file easier. Can there be an escape if there is a tab in a glyph name?

That was half in jest. On the one hand we can easily make a tab separated text work even when tab chars can occur in glyph names, on the other hand I don't think it's all that reasonable to allow such invisibles to occur in glyph names. How about NUL characters? Return/newline? Anything < 0x20, really.

typesupply commented 5 years ago

The fact that ambiguities exist elsewhere in the spec is no reason to not try and avoid it here, especially if the solution is so trivial and obviously correct. Inherent correctness is better than correctness-that-needs-to-be-verified.

Ambiguities can be created in a plain text file:

0041 A
0041 B
0042 B

Any spec is going to have to deal with these issues.

Don't get me wrong. I love plain text files and my first thought when "we need a cmap" came up was 0041 A but then I started thinking about precedents, backwards compatibility, etc. I just want to make sure that we have a compelling reason to invent a wheel in this case. A custom format, no matter how simple, is going to introduce more code complexity. We're also talking about breaking existing code (probably not much, but > none). I don't want to be completely dismissive of the work that will introduce.

On the one hand we can easily make a tab separated text work even when tab chars can occur in glyph names, on the other hand I don't think it's all that reasonable to allow such invisibles to occur in glyph names. How about NUL characters? Return/newline? Anything < 0x20, really.

I thought there was something in the contents.plist spec about excluded characters, but it looks like it is only in the example name to file name algorithm. Yikes. The spec should be changed. I'll open an issue for that.

justvanrossum commented 5 years ago

Ambiguities can be created in a plain text file:

True, but we were arguing about mapping {unicode: glyphName} vs {glyph: [uni1, ...]}, no?

Even with a text file, it's easy to say (and verify) "the first column must be a unique value", but it's a lot harder if we spec it the other way around. Sure, not impossible, just less elegant and less logical. For years we've been thinking like "glyphs have unicode values". I'm arguing that it's time we should change our thinking towards "unicode code points map to glyphs", as that's a more realistic model of how fonts actually work.

Your point about custom formats is well taken. The data we're talking about here is quite flat, and apart from the dictionary aspect that guarantees keys to be unique, the nested plist structures don't buy us much. Sure, it can be made to work by encoding unicode keys as hex strings, but I'm arguing that that additional layer of encoding on top of plist reduces the benefit of the plist standard. But either way, let's first focus on the next point:

We're also talking about breaking existing code (probably not much, but > none). I don't want to be completely dismissive of the work that will introduce.

Yes. This is probably the most important question in this discussion: what breaks if we stop guaranteeing the order of g.unicodes?

typesupply commented 5 years ago

I'm arguing that it's time we should change our thinking towards "unicode code points map to glyphs", as that's a more realistic model of how fonts actually work.

That's a very good point.

We're also talking about breaking existing code (probably not much, but > none). I don't want to be completely dismissive of the work that will introduce.

Yes. This is probably the most important question in this discussion: what breaks if we stop guaranteeing the order of g.unicodes?

I don't know for sure, but I've been thinking about it. In my own work, I tend to use glyph.unicode as a shortcut because it's easier to deal with an int than a list containing a single int.

glyph.unicode = 41

is easier and more easy to understand than:

glyph.unicodes = [41]

Ease of input aside, I looked through some of my code and it looks like I use the "primary Unicode" assumption mostly in interface stuff. (Here's a place in defcon that gets used for this.) The impact of a change to this behavior will only potentially apply to double mapped glyphs and even then it won't be a mission critical change. So, I can't speak for everyone, but I think the impact on my code will be minor.

A point that I've been waiting for someone to bring up is that the first item in the UFO Design Philosophy is "The data must be human readable and human editable." and 0041 A is a heck of a lot more human readable and editable than the plist description of the same thing. But, no one has brought it up so I'll drop my neutrality for a second and mention it.

I'd like to see what a Python reader and writer (that assumes that #80 will be put in place) would look like for the proposed format.

justvanrossum commented 5 years ago

Here's a super minimal dumper/loader. It assumes glyph names don't contain control chars.

from io import StringIO

def cmapdump(cmap, f):
    for uni, glyphName in sorted(cmap.items()):
        f.write("%04X\t%s\n" % (uni, glyphName))

def cmapload(f):
    cmap = {}
    for line in f:
        if line and line[-1] == "\n":
            line = line[:-1]
        uni, glyphName = line.split("\t", 1)
        uni = int(uni, 16)
        cmap[uni] = glyphName
    return cmap

cmap = {
    0x30: "zero",
    ord("a"): "a",
    ord("b"): "b",
    ord("z"): "z z z z",
    0x1e0000: "å ß é"
}

f = StringIO()
cmapdump(cmap, f)
tabSepData = f.getvalue()
print(tabSepData)
f.seek(0)
cmap2 = cmapload(f)
assert cmap == cmap2

benkiel commented 5 years ago

Could someone write this up and PR it? Would be good to look at wording to comment on, as I think the general consensus is that this should happen.

justvanrossum commented 5 years ago

I will try soon, unless someone beats me to it. I need to familiarize myself with the document structure, though. We also need to look at #78, #79 and #80. It will be UFO version 3.1, yes?

benkiel commented 5 years ago

Yes, I think that's a good set of things for 3.1.

schriftgestalt commented 5 years ago

I’m coming form the public.skipExportGlyphs discussion on the glyphsLib repo. And was pointed to #77 .

You have a very long discussion about a very specific problem that is caused by a structural weakness of the file format. And if that would be solved properly, we would not need that big change in the first place. I think any information should be stored as closed to all other related information as possible and if something is changed, it should result in the least possible changes elsewhere in the data structure. So if a glyph is deleted, it shouldn’t leave info in to many places (cmap, kerning classes) (there is a weak point in my argument with components, I know).

There are more properties that have the same problems, the export state is one of it. You are thinking about changing the structure quite a bit so why not allow discussion about the structure?

I suggested that before but if we are speaking about a new version I’ll try again.

I think there are several layers of information needed. 1) Font

family name
OpenType features
masters
glyphs 2) Masters
metrics
designspace coordinates
guides 3) Glyph
name
unicode
export state
color label
(kerning groups)
layers 4) Layers
outliens, components
width
color label
guides

This solves quite a lot of the ambiguities that are in the current spec.

You where concerned by the overhead of producing a unicode to glyph mapping. The current structure has a so much bigger overhead of producing a single glyph from a designspace. One needs to go through all layer folders in the .ufo and then go through all possible extra .ufos to find all intermediate masters (and again all its layers). So if you have a font with a bunch of extra layers and intermediate masters, you need to read the content of a couple thousand folders just to compile one glyph.

justvanrossum commented 5 years ago

To move the unicode field out of glif isn't a huge structural change.
If we were to design (something like) UFO today, would it be different? Quite likely.
Does it make sense to (pretty much) rewrite the UFO format from scratch at this point? I don't think so.

schriftgestalt commented 5 years ago

But without some serious changes we will be stuck.

justvanrossum commented 5 years ago

You where concerned by the overhead of producing a unicode to glyph mapping. The current structure has a so much bigger overhead of producing a single glyph from a designspace. One needs to go through all layer folders in the .ufo and then go through all possible extra .ufos to find all intermediate masters (and again all its layers). So if you have a font with a bunch of extra layers and intermediate masters, you need to read the content of a couple thousand folders just to compile one glyph.

That is simply not true, unless you're exaggerating to new levels of hyperbole :) To get the data needed for one glyph you don't need to look up more glyph data items (files) then there are masters. It's simply O(N) for N masters. I don't think N will ever go into the thousands. Let alone that there will be thousands of folders involved.

To get a cmap-like data structure so I can typeset something (anything) I need to parse ALL glyphs from the default layer. And that's very expensive if the font is large. It's O(N) for N number of glyphs in the font.

One of the cool properties of the UFO format is that you can read most of it lazily. Unicode values being stored in the glyph data limits this ability, hence this proposal.

schriftgestalt commented 5 years ago

I do not exaggerate. With a very typical setup from a designer working in Glyphs where each glyph has a main glyph and a background and maybe some extra layers (copies or brace/brackes). If that is stored in a ufo3, it up with a couple hundred .glif-folders (most layers have individual names). All of those folders have to be parsed to collect all .glifs that belong to one glyph. And for a designspace with a bunch of masters, that multiplies.

justvanrossum commented 5 years ago

In the designspace it is specified which layers are used for which masters, so 99.9% of those layers are not needed to build a glyph, and don't need to be parsed.

schriftgestalt commented 5 years ago

How much of a use case is it to use a ufo to typeset something. The sfnt format is optimised for that. How often do you need a glyph from a .ufo by unicode lookup? Not during design time (the designer likes to see all of them) and during production (where the glyphs are probably accessed by index or name).

justvanrossum commented 5 years ago

I need it all the time, otherwise I wouldn't have posted this proposal.

unified-font-object / ufo-spec

[UFO4] Separating unicode from glif #77