twardoch / opentype-layout

opentype-layout working group documents
0 stars 0 forks source link

2020-04-23 proposal for the SFNT `LABL` table: Human-readable glyph labels #1

Open twardoch opened 4 years ago

twardoch commented 4 years ago

I have created a draft proposal for the SFNT LABL table: Human-readable glyph labels.

Short story: PostScript names are not enough, there are various legitimate cases where a font vendor would like to include human-readable descriptions for glyphs. My LABL proposal is very lightweight. I’d like to suggest this very issue itself for a place of discussion of the proposal.

(Note: I’ve placed the proposal in my fork of the opentype-layout repo, and created a pull request. I’m not sure what’s the best method.)

twardoch commented 4 years ago

@behdad @anthrotype @typesupply @LettError @moyogo @justvanrossum @miguelsousa @khaledhosny @madig @jenskutilek @mhosken @benkiel @davelab6 @typemytype @robmck-ms @schriftgestalt @josh-hadley @cjchapman @frankrolf @n7s @punchcutter @ebraminio @blueshade7 @garretrieger @rsheeter @dscorbett @n8willis @jfkthame @TiroTypeworks @Lorp

I have created a draft proposal for the SFNT LABL table: Human-readable glyph labels. Please kindly review and discuss at https://github.com/twardoch/opentype-layout/issues/1

simoncozens commented 4 years ago

Two random comments. First a stylistic one about the proposal; it took me until the Discussion section to really grasp what the usefulness of this would be for the end user. Conceptually I get the idea that more metadata in the font is a good thing, but so what? When you gave the examples of the alternate asterisk and the map symbols, then I really got it. So I would highlight in the Rationale section what problems it solves: allowing clients to communicate the glyph names to user so that they can more easily choose between similar glyphs, (even this needs more motivating, because why do you need to tell them that the glyph is a banana when they can see it's a banana?) and allowing for font replacement while keeping semantics.

The second comment is that you appear to be solving two different problems at the same time. That's not necessarily a bad thing and can be a sign of efficiency, but it can also be a sign of competing goals. The first problem is glyph identification, as with the asterisk example. In a sense this is a "decoding" problem. But when you give the map example and say that clients can search through the LABL list to find a glyph that they're looking for, this is something else entirely - you've allowing font makers to specify glyph IDs based on an externally agreed, non-Unicode alternate encoding, with the names are acting as "codepoints". So this is an encoding problem.

There's not a huge amount in it in terms of performance for most fonts, but when you're specifying how things are encoded, you don't want to have to search through a dictionary of glyphID -> codepoint until you hit upon the codepoint you're trying to display. You want to ideally have a map of codepoints to glyph IDs. So this LABL data structure is the opposite of what it should be for the job at hand; in fact, alternate encodings feel like they belong in the cmap table.

Like I said, that's not necessarily a performance issue, but to me it smells like evidence that these two problems - encoding and decoding - might be better solved in different ways.

twardoch commented 4 years ago

Thanks. In a sense, one system is to go around Unicode. All the major icon fonts do it now: you get a CSS class with a name like fa-cart, and this is really your input method. The CSS class then generates a PUA codepoint via content-before, and then the PUA codepoint gets converted to a glyph via cmap. It's PUA, so in a sense it's not really Unicode.

Actually putting this stuff into CSS may be the efficient mechanism for the web use, but those CSS files could be a cache. Those fa-cart labels should also have some place in the font — even if for the web font delivery, the LABL table could be scrapped, very much like the name table largely is.

Thanks, I'll improve some of my rationale points. You're right, the tech implementation should come later.

simoncozens commented 4 years ago

In your icon font example, if it were possible for a client to access a glyph by name, all this goes away, right? Your CSS asks for "fa-cart", you have a glyph in your font named "fa-cart", and some CSS magic asks for the glyph by name bypassing any kind of encoding.

twardoch commented 4 years ago
  1. I’ve updated the proposal so it now has a more extended intro.
  2. If you mean PostScript glyph names, yes. But only an extremely limited character set is permitted in PostScript glyph names — not even hyphen is allowed in shipping fonts, because glyph names are used directly in technologies like PostScript that have a colossal legacy. There can be only one PostScript glyph name per glyph, so there is no mechanism for aliases or synonyms.
yarmola commented 4 years ago

@simoncozens : you don't want to have to search through a dictionary of glyphID -> codepoint until you hit upon the codepoint you're trying to display.

Was about to write exactly the same thing. Some method of quick-search for the glyph id should be embedded, as this thing should scale up to support many glyphs (making it different to the name table structure in this regard)

Lorp commented 4 years ago

Thanks for writing this up Adam, and for reminding me of our conversation about symbol vocabularies in Amsterdam. I was sorry not to see your 2013 proposal get any traction.

I have a couple of small recommentations to your proposal:

  1. A binary representation, while apparently “ready to go”, seems premature. I would prefer to see a TTX XML representation at this stage. XML formats can be more easily understood and revised by a wide range of stakeholders, and also more easily parsed by working prototypes (which I and perhaps others would enjoy building).

  2. Add an optional scale attribute for each glyph. Graphical representations for a molecule, an ant, person, a stadium and a planet would benefit from knowing the physical length of a font unit.

I’m keen to explore the idea further. Fonts formats are excellent ways of representing many kinds of graphics, and keeping them largely restricted to text increasingly feels like arbitary gatekeeping.

Coincidentally (or maybe not), only 3 days ago the Unicode Consortium closed comments on a Public Review Issue (background document, based on an earlier proposal by @macchiati) on whether QID Emoji Tag Sequences should be incorporated into Unicode. (QIDs identify entities in WikiData — which, by the way, would be just another vocabularyID in your proposal.) If I understand correctly, under this proposal fonts would encode QIDs as Unicode variation sequences using cmap format 14, as is currently used for emoji such as “woman pilot: medium skin tone” 👩🏽‍✈️. Compared with your proposal, this implements only a single vocabulary. It’s worth reading feedback here (notably that from @charlotte-buff), here (in yellow) and this from Mozilla.

yarmola commented 4 years ago

Compactness is also an issue: even if glyphID > language record(s) can be cached, having 8 bytes extra data for every label record is unacceptable.

There are 2 approaches: "storage" will optimize for size and rely on caching of indexing data in client. "index" will optimize for access.

I'd vote for former which can be implemented as set of batches for incremental glyphIDs with name data organized as zero-separated pack of names. (2 zeroes will mean "no name for this glyphID"). That will make it most optimized for storage (esp. if utf8 is used for encoding): only one glyph ID/count/offset (8 bytes) for batch, one extra byte for label, chance to optimize to ignore single-glyph "holes" in data (one byte is smaller than 8 bytes to define extra batch). In worst case it will have the same overhead as your original structure (8 bytes for single-glyph batch).

Type Name Description
uint16 glyphID Glyph ID for the first glyph in the batch
uint16 count Number of glyphs in the batch
Offset32 offset Offset from start of storage area to the first label in the batch (in bytes).

"Index" approach will require more thinking (but I don't think that direct-access to such data is important for current or future clients).

(edited to change "batch record" structure from length to offset and include glyph count in a batch).

justvanrossum commented 4 years ago

A binary representation, while apparently “ready to go”, seems premature. I would prefer to see a TTX XML representation at this stage.

I agree. Getting an understanding of the overall structure is way more important than the binary details. I would even go for json ar yaml at this point.

yarmola commented 4 years ago

Getting an understanding of the overall structure is way more important than the binary details.

Agree. However some "strategic" thinking about binary implementation (like "optimize for access speed" vs "optimize for storage") can be made.

behdad commented 4 years ago

Thanks Adam. Copying my comments from https://github.com/OpenType/opentype-layout/pull/18 I haven't check your rewrite. Will do.

===

Name table is already tight. Put it in its own table. And remove the platform encoding BS. Reference languages in ltag or whatever 🍎 table it is.

...skims...

Yeah, no. This is importing way too much data into the engines with little agreed benefits. Worrying that up to input methods is impractically hard.

Lorp commented 4 years ago

@Behdad why is it “importing way too much data into the engines”? The engines are free to ignore it and focus on delivering glyphs and metrics, and leave to applications the burden of turning strings into glyphIds. At MyFonts we always knew our symbol/icon fonts were massively underused because of the lack of a standard metadata scheme. BTW the LABL table can be dropped from the font in many use cases, though would be useful for screenreaders.

@twardoch There are a couple of potentially confusing typos in the ‘Other vocabularyIDs’ section, the Wikipedia and Noun Project vocabularyIDs are not consistent.

@twardoch Could you please add WikiData as an entry in the ‘Other vocabularyIDs’ section? QIDs are more stable and powerful than Wikipedia page titles, and share the advantages of built-in translations.