Add GlyphsData to glyphsLib

brawer commented 7 years ago

To convert Glyphs files to UFOs in the exact same way as Glyphs.app, glyphsLib will need the two data files GlyphData.xml and GlyphData_Ideographs.xml from @schriftgestalt’s GlyphsInfo repository. Most of the content is derived from either the Adobe Glyph List (AGL) or from Unicode data files. However, the data in Glyphs has “some adjustments”. How should glyphsLib want to deal with this?

We could copy the XML files from upstream, and write some code in glyphsLib that parses the data files upon first use. This would be very simple to implement. On the minus side, this would make glyphsLib slow to use, because the XML files would have to be parsed. Also, we’d ship some 5MB of data that is mostly redundant because AGL is already part of fonttools, and the Unicode data is part of the core Python libraries.
We could write a tool that generates a few Python data structures from (GlyphData.xml, GlyphData_Ideographs.xml, AGL-in-fonttools, unicodedata). The generated Python file would only store the deltas, so it will likely be a rather small table. Some accessor code in glyphsLib would first consult this “adjustments table.” If the adjustments table contains no entry, the code would resort to fonttools (for AGL-derived data), respectively to unicodedata (for Unicode-derived data). A minor complication would be version skew. AGL is stable, but Unicode character classes change over time. Therefore, it might make sense to base things on unicodedata.ucd_3_2_0, at the cost of having slightly bloated adjustment tables in glyphsLib.

Thoughts? I’m tending to go with the second approach, but wanted to get everyone’s feedback first.

anthrotype commented 7 years ago

I like the second approach much better. Instead of having adjustment tables for anything new since unicode data 3.2.0, we could use the unicodedata2 backport which now works on both python 2 and 3, and has the latest 9.0.0: https://github.com/mikekap/unicodedata2

anthrotype commented 7 years ago

it already has pre-compiled wheels for windows on PyPI, but I can easily make it ship mac + linux wheels too.

https://pypi.python.org/pypi/unicodedata2

twardoch commented 7 years ago

Many of our tools already do:

try: 
  import unicodedata2 as unicodedata
except ImportError: 
  import unicodedata

This was done after my suggestion about a year ago. It’s great that unicodedata2 is active, I find it quite essential.

In the longer term, I think it’d be useful to create a glyphWisdom repo that would unify a few projects by providing importers from various sources, editable user-overridable data sets ("local" vs. "global"), and, whenever possible, an optimized backend for fast access:

Better access to Unicode data

/mikekap/unicodedata2 — this is replacement of the unicodedata module which is maintained and is up-to-date to Unicode 9.0.0
/nagisa/unicodeblocks — this is an "add-in" that also exposes Unicode blocks
/ConradIrwin/unicodescript — this is an "add-in" that also exposes Unicode scripts
/href/Python-Unicode-Collation-Algorithm — this is a module that allows for Unicode-aware sorting

Fancy access to Unicode data

/Boldewyn/unicodeinfo — this is a cool project that combines info on Unicode characters from various sources, including Wikipedia snippets.

Glyph naming

/adobe-type-tools/agl-aglfn — Adobe AGL & AGLFN
/adobe-type-tools/agl-specification — Adobe glyph naming spec
/adobe-type-tools/afdko/.../AGD.txt and /adobe-type-tools/afdko/.../agd.py — Adobe Glyph Dictionary
/schriftgestalt/GlyphsInfo — Glyphs glyph naming
/LettError/glyphNameFormatter — Glyph naming from the UFO community

Glyph sets

CJK

/adobe-type-tools/cmap-resources

CLDR

/googlei18n/nototools/ has some CLDR parsing
/python-babel/babel has lots

schriftgestalt commented 7 years ago

There are some rules that try to find glyph info for glyphs not in GlyphData. That is mostly for glyphs with suffixes and ligatures.

The suggestions from Adam are very interesting but might be a bit to much for what we need here?

Would an SQLite cache solve the speed issue?

anthrotype commented 7 years ago

How slow is slow? On my machine, I can parse the GlyphData.xml with cElementTree in a tenth of a second. Not so bad, I'd say.

from __future__ import print_function
import timeit

setup = """\
from xml.etree import cElementTree as etree
"""

code = """\
tree = etree.parse("GlyphData.xml")
glyph_data = {}
for element in tree.getroot():
    attrs = dict(element.attrib)
    name = attrs.get('name')
    glyph_data[name] = attrs
"""

result = timeit.repeat(code, setup, repeat=10, number=1)

print("min:", min(result))
print("avg:", sum(result)/len(result))

returns:

min: 0.106187105179
avg: 0.11096932888

anthrotype commented 7 years ago

having said that, I'm all for not having to duplicate data if it's available elsewhere (unicodedata) and if it's faster to load. So nevermind.

schriftgestalt commented 7 years ago

And what about having less dependencies?

anthrotype commented 7 years ago

just for the record, if we serialize that mega dictionary of dictionaries as a json file, loading it with the built-in json module takes half the time than parsing the xml with cElementTree (about 0.05 vs 0.10 seconds).

anthrotype commented 7 years ago

having less dependencies?

I don't see that as an issue. It's easy to make pre-compiled wheels available for unicodedata2, and adding a dependency to glyphsLib is as easy as appending it to the list of install_requires. Note that fontmake also relies on a few native C extension modules: compreffor and pyclipper (for booleanOperations); and we handle those just fine.

brawer commented 7 years ago

I’ve been playing with this for a bit. I think the key part is mapping from the Glyphs-internal glyph names to production names. Almost all production names follow the AGL specification and can be algorithmically mapped to a Unicode string. From that Unicode string, the other properties in GlyphsData.xml can be derived.

There’s exceptions where this approach doesn’t give the same results as in GlyphsData.xml, but these are just a handful, so they can be handled with exception tables. A couple of these exceptions seem to be bugs in GlyphsData.xml (such as missing production names for fractions). Georg kindly said he’ll update his upstream data, which should get most problems/exceptions fixed.

In more detail, my proposal would be the following:

in the existing fonttools agl module, write a new function toUnicode(glyphName) that returns a Unicode string for an AGL-compliant glyph name. This function might be useful for other purposes too, and it is unrelated to Glyphs.app; hence the proposal of adding this to fonttools. The implementation has to handle non-BMP characters. My plan is to return UTF16-encoded Unicode strings on narrow Python builds, since this seems better than crashing.
in glyphsLib, write a helper script that generates some Python dictionaries from GlyphsData.xml and GlyphsIdeographicData.xml. This script only needs to be run when importing new versions of the data; it will not be necessary to run the generation script when using glyphsLib. This is very comparable to the MetaTools in fonttools.
in glyphsLib, write a function getProductionName(glyphName) to return a production name for a Glyphs.app-internal glyphs name. The implementation will use the Python dictionaries generated by the above helper script. Once this part is done, https://github.com/googlei18n/glyphsLib/issues/12 can be finished.
in glyphsLib, write a function getGlyphData(glyphName) to return a named tuple given a Glyphs.app-internal glyphs name. The tuple’s properties would correspond to the attributes in GlyphsData.xml. For most glyphs, the implementation can derive the unicode property by calling the functions getProductionName() and fontTools.agl.toUnicode(), but it will also consult an exception table generated by the above helper script. For most glyphs, the category property can be derived from calling unicodedata on the unicode string, but again we’ll need a small exception table to be generated by the helper script.

I’ve got an initial version working, but it needs some more polishing. If people are fine with the proposal, I’ll start sending pull requests soon.

anthrotype commented 7 years ago

This is great, thanks Sascha!

Don't know if it's what you're looking for, but as you probably know already in fontTools.misc.py23 we have a unichr and byteord functions that work like python3 chr/ord built-ins, even on narrow python2 builds: https://github.com/fonttools/fonttools/blob/1a9389653cfec45be04f9cc1c7b820fa4d9e6b8b/Lib/fontTools/misc/py23.py#L33-L95

anthrotype commented 7 years ago

I sent a pull request to https://github.com/mikekap/unicodedata2/pull/12 so that it also builds wheel packages for Mac and Linux. They are already available on my fork at: https://github.com/anthrotype/unicodedata2/releases/tag/9.0.0-2+wheels When (and if) @mikekap uploads them to PyPI, we can have unicodeda2 as requirement in our pure-python packages without our users needing a C compiler.

schriftgestalt commented 7 years ago

How do you think the unicodedata2 repo should work? By looking up the unicode instead of the name? That only works reliably for encoded glyphs. Just add any suffix and you need proper name2info mapping.

anthrotype commented 7 years ago

As Sascha explained, we can generate a mapping between Glyphs-internal glyph names to production names, and from the latter (if they follow AGL) derive the unicode string, and from unicodedata derive the character properties; store the exceptions for the rest which don't.

khaledhosny commented 7 years ago

write a new function toUnicode(glyphName) that returns a Unicode string for an AGL-compliant glyph name.

Why not just return an integer (code point) and avoid having to deal with wide/narrow build issues?

anthrotype commented 7 years ago

Because the unicodedata module functions expect Unicode strings.

brawer commented 7 years ago

We now have name, production_name, unicode, category, and subcategory. As far as I can tell, these properties should be sufficient for converting Glyphs files to UFO. If we need additional properties (eg. the descriptive name), it’d be easy to add them.

behdad commented 7 years ago

FWIW would be nice if we had the XML-parsing code as well, such that custom GlyphsData.xml was also supported. When doing that, I'm not sure if keeping the default use s separate codepath is a good idea.

Also, I believe GlyphsData.xml changes over time as well. How are we supposed to deal with that?

anthrotype commented 7 years ago

GlyphsData.xml changes over time as well. How are we supposed to deal with that?

by periodically re-running MetaTools/generate_glyphdata.py, no?

anthrotype commented 7 years ago

would be nice if we had the XML-parsing code as well, such that custom GlyphsData.xml was also supported

The generate_glyphdata.py does all that already, the only thing it is hard-coded to fetch the official GlyphData.xml from schriftgestalt/GlyphsInfo repository. But we could add a command-line option to pass the path to a local custom GlyphData.xml file as an alternative to downloading the official one. Then users could pass their own custom glyphdata_generated module as data argument to glyphdata.get_glyph function.

googlefonts / glyphsLib

Add GlyphsData to glyphsLib #88