googlefonts / glyphsLib

A bridge from Glyphs source files (.glyphs) to UFOs
Apache License 2.0
181 stars 51 forks source link

Add GlyphsData to glyphsLib #88

Closed brawer closed 7 years ago

brawer commented 7 years ago

To convert Glyphs files to UFOs in the exact same way as Glyphs.app, glyphsLib will need the two data files GlyphData.xml and GlyphData_Ideographs.xml from @schriftgestalt’s GlyphsInfo repository. Most of the content is derived from either the Adobe Glyph List (AGL) or from Unicode data files. However, the data in Glyphs has “some adjustments”. How should glyphsLib want to deal with this?

Thoughts? I’m tending to go with the second approach, but wanted to get everyone’s feedback first.

anthrotype commented 7 years ago

I like the second approach much better. Instead of having adjustment tables for anything new since unicode data 3.2.0, we could use the unicodedata2 backport which now works on both python 2 and 3, and has the latest 9.0.0: https://github.com/mikekap/unicodedata2

anthrotype commented 7 years ago

it already has pre-compiled wheels for windows on PyPI, but I can easily make it ship mac + linux wheels too.

https://pypi.python.org/pypi/unicodedata2

twardoch commented 7 years ago

Many of our tools already do:

try: 
  import unicodedata2 as unicodedata
except ImportError: 
  import unicodedata

This was done after my suggestion about a year ago. It’s great that unicodedata2 is active, I find it quite essential.

In the longer term, I think it’d be useful to create a glyphWisdom repo that would unify a few projects by providing importers from various sources, editable user-overridable data sets ("local" vs. "global"), and, whenever possible, an optimized backend for fast access:

Better access to Unicode data

Fancy access to Unicode data

Glyph naming

Glyph sets

CJK

CLDR

schriftgestalt commented 7 years ago

There are some rules that try to find glyph info for glyphs not in GlyphData. That is mostly for glyphs with suffixes and ligatures.

The suggestions from Adam are very interesting but might be a bit to much for what we need here?

Would an SQLite cache solve the speed issue?

anthrotype commented 7 years ago

How slow is slow? On my machine, I can parse the GlyphData.xml with cElementTree in a tenth of a second. Not so bad, I'd say.

from __future__ import print_function
import timeit

setup = """\
from xml.etree import cElementTree as etree
"""

code = """\
tree = etree.parse("GlyphData.xml")
glyph_data = {}
for element in tree.getroot():
    attrs = dict(element.attrib)
    name = attrs.get('name')
    glyph_data[name] = attrs
"""

result = timeit.repeat(code, setup, repeat=10, number=1)

print("min:", min(result))
print("avg:", sum(result)/len(result))

returns:

min: 0.106187105179
avg: 0.11096932888
anthrotype commented 7 years ago

having said that, I'm all for not having to duplicate data if it's available elsewhere (unicodedata) and if it's faster to load. So nevermind.

schriftgestalt commented 7 years ago

And what about having less dependencies?

anthrotype commented 7 years ago

just for the record, if we serialize that mega dictionary of dictionaries as a json file, loading it with the built-in json module takes half the time than parsing the xml with cElementTree (about 0.05 vs 0.10 seconds).

anthrotype commented 7 years ago

having less dependencies?

I don't see that as an issue. It's easy to make pre-compiled wheels available for unicodedata2, and adding a dependency to glyphsLib is as easy as appending it to the list of install_requires. Note that fontmake also relies on a few native C extension modules: compreffor and pyclipper (for booleanOperations); and we handle those just fine.

brawer commented 7 years ago

I’ve been playing with this for a bit. I think the key part is mapping from the Glyphs-internal glyph names to production names. Almost all production names follow the AGL specification and can be algorithmically mapped to a Unicode string. From that Unicode string, the other properties in GlyphsData.xml can be derived.

There’s exceptions where this approach doesn’t give the same results as in GlyphsData.xml, but these are just a handful, so they can be handled with exception tables. A couple of these exceptions seem to be bugs in GlyphsData.xml (such as missing production names for fractions). Georg kindly said he’ll update his upstream data, which should get most problems/exceptions fixed.

In more detail, my proposal would be the following:

I’ve got an initial version working, but it needs some more polishing. If people are fine with the proposal, I’ll start sending pull requests soon.

anthrotype commented 7 years ago

This is great, thanks Sascha!

Don't know if it's what you're looking for, but as you probably know already in fontTools.misc.py23 we have a unichr and byteord functions that work like python3 chr/ord built-ins, even on narrow python2 builds: https://github.com/fonttools/fonttools/blob/1a9389653cfec45be04f9cc1c7b820fa4d9e6b8b/Lib/fontTools/misc/py23.py#L33-L95

anthrotype commented 7 years ago

I sent a pull request to https://github.com/mikekap/unicodedata2/pull/12 so that it also builds wheel packages for Mac and Linux. They are already available on my fork at: https://github.com/anthrotype/unicodedata2/releases/tag/9.0.0-2+wheels When (and if) @mikekap uploads them to PyPI, we can have unicodeda2 as requirement in our pure-python packages without our users needing a C compiler.

schriftgestalt commented 7 years ago

How do you think the unicodedata2 repo should work? By looking up the unicode instead of the name? That only works reliably for encoded glyphs. Just add any suffix and you need proper name2info mapping.

anthrotype commented 7 years ago

As Sascha explained, we can generate a mapping between Glyphs-internal glyph names to production names, and from the latter (if they follow AGL) derive the unicode string, and from unicodedata derive the character properties; store the exceptions for the rest which don't.

khaledhosny commented 7 years ago

write a new function toUnicode(glyphName) that returns a Unicode string for an AGL-compliant glyph name.

Why not just return an integer (code point) and avoid having to deal with wide/narrow build issues?

anthrotype commented 7 years ago

Because the unicodedata module functions expect Unicode strings.

brawer commented 7 years ago

We now have name, production_name, unicode, category, and subcategory. As far as I can tell, these properties should be sufficient for converting Glyphs files to UFO. If we need additional properties (eg. the descriptive name), it’d be easy to add them.

behdad commented 7 years ago

FWIW would be nice if we had the XML-parsing code as well, such that custom GlyphsData.xml was also supported. When doing that, I'm not sure if keeping the default use s separate codepath is a good idea.

Also, I believe GlyphsData.xml changes over time as well. How are we supposed to deal with that?

anthrotype commented 7 years ago

GlyphsData.xml changes over time as well. How are we supposed to deal with that?

by periodically re-running MetaTools/generate_glyphdata.py, no?

anthrotype commented 7 years ago

would be nice if we had the XML-parsing code as well, such that custom GlyphsData.xml was also supported

The generate_glyphdata.py does all that already, the only thing it is hard-coded to fetch the official GlyphData.xml from schriftgestalt/GlyphsInfo repository. But we could add a command-line option to pass the path to a local custom GlyphData.xml file as an alternative to downloading the official one. Then users could pass their own custom glyphdata_generated module as data argument to glyphdata.get_glyph function.