Closed brawer closed 7 years ago
I like the second approach much better. Instead of having adjustment tables for anything new since unicode data 3.2.0, we could use the unicodedata2 backport which now works on both python 2 and 3, and has the latest 9.0.0: https://github.com/mikekap/unicodedata2
it already has pre-compiled wheels for windows on PyPI, but I can easily make it ship mac + linux wheels too.
Many of our tools already do:
try:
import unicodedata2 as unicodedata
except ImportError:
import unicodedata
This was done after my suggestion about a year ago. It’s great that unicodedata2
is active, I find it quite essential.
In the longer term, I think it’d be useful to create a glyphWisdom
repo that would unify a few projects by providing importers from various sources, editable user-overridable data sets ("local" vs. "global"), and, whenever possible, an optimized backend for fast access:
Better access to Unicode data
unicodedata
module which is maintained and is up-to-date to Unicode 9.0.0Fancy access to Unicode data
Glyph naming
Glyph sets
CJK
CLDR
There are some rules that try to find glyph info for glyphs not in GlyphData. That is mostly for glyphs with suffixes and ligatures.
The suggestions from Adam are very interesting but might be a bit to much for what we need here?
Would an SQLite cache solve the speed issue?
How slow is slow? On my machine, I can parse the GlyphData.xml with cElementTree in a tenth of a second. Not so bad, I'd say.
from __future__ import print_function
import timeit
setup = """\
from xml.etree import cElementTree as etree
"""
code = """\
tree = etree.parse("GlyphData.xml")
glyph_data = {}
for element in tree.getroot():
attrs = dict(element.attrib)
name = attrs.get('name')
glyph_data[name] = attrs
"""
result = timeit.repeat(code, setup, repeat=10, number=1)
print("min:", min(result))
print("avg:", sum(result)/len(result))
returns:
min: 0.106187105179
avg: 0.11096932888
having said that, I'm all for not having to duplicate data if it's available elsewhere (unicodedata
) and if it's faster to load.
So nevermind.
And what about having less dependencies?
just for the record, if we serialize that mega dictionary of dictionaries as a json file, loading it with the built-in json
module takes half the time than parsing the xml with cElementTree (about 0.05 vs 0.10 seconds).
having less dependencies?
I don't see that as an issue. It's easy to make pre-compiled wheels available for unicodedata2, and adding a dependency to glyphsLib is as easy as appending it to the list of install_requires
.
Note that fontmake also relies on a few native C extension modules: compreffor and pyclipper (for booleanOperations); and we handle those just fine.
I’ve been playing with this for a bit. I think the key part is mapping from the Glyphs-internal glyph names to production names. Almost all production names follow the AGL specification and can be algorithmically mapped to a Unicode string. From that Unicode string, the other properties in GlyphsData.xml can be derived.
There’s exceptions where this approach doesn’t give the same results as in GlyphsData.xml, but these are just a handful, so they can be handled with exception tables. A couple of these exceptions seem to be bugs in GlyphsData.xml (such as missing production names for fractions). Georg kindly said he’ll update his upstream data, which should get most problems/exceptions fixed.
In more detail, my proposal would be the following:
in the existing fonttools agl module, write a new function toUnicode(glyphName)
that returns a Unicode string for an AGL-compliant glyph name. This function might be useful for other purposes too, and it is unrelated to Glyphs.app; hence the proposal of adding this to fonttools. The implementation has to handle non-BMP characters. My plan is to return UTF16-encoded Unicode strings on narrow Python builds, since this seems better than crashing.
in glyphsLib, write a helper script that generates some Python dictionaries from GlyphsData.xml and GlyphsIdeographicData.xml. This script only needs to be run when importing new versions of the data; it will not be necessary to run the generation script when using glyphsLib. This is very comparable to the MetaTools in fonttools.
in glyphsLib, write a function getProductionName(glyphName)
to return a production name for a Glyphs.app-internal glyphs name. The implementation will use the Python dictionaries generated by the above helper script. Once this part is done, https://github.com/googlei18n/glyphsLib/issues/12 can be finished.
in glyphsLib, write a function getGlyphData(glyphName)
to return a named tuple given a Glyphs.app-internal glyphs name. The tuple’s properties would correspond to the attributes in GlyphsData.xml. For most glyphs, the implementation can derive the unicode
property by calling the functions getProductionName()
and fontTools.agl.toUnicode()
, but it will also consult an exception table generated by the above helper script. For most glyphs, the category
property can be derived from calling unicodedata
on the unicode
string, but again we’ll need a small exception table to be generated by the helper script.
I’ve got an initial version working, but it needs some more polishing. If people are fine with the proposal, I’ll start sending pull requests soon.
This is great, thanks Sascha!
Don't know if it's what you're looking for, but as you probably know already in fontTools.misc.py23 we have a unichr
and byteord
functions that work like python3 chr
/ord
built-ins, even on narrow python2 builds:
https://github.com/fonttools/fonttools/blob/1a9389653cfec45be04f9cc1c7b820fa4d9e6b8b/Lib/fontTools/misc/py23.py#L33-L95
I sent a pull request to https://github.com/mikekap/unicodedata2/pull/12 so that it also builds wheel packages for Mac and Linux.
They are already available on my fork at:
https://github.com/anthrotype/unicodedata2/releases/tag/9.0.0-2+wheels
When (and if) @mikekap uploads them to PyPI, we can have unicodeda2
as requirement in our pure-python packages without our users needing a C compiler.
How do you think the unicodedata2 repo should work? By looking up the unicode instead of the name? That only works reliably for encoded glyphs. Just add any suffix and you need proper name2info mapping.
As Sascha explained, we can generate a mapping between Glyphs-internal glyph names to production names, and from the latter (if they follow AGL) derive the unicode string, and from unicodedata derive the character properties; store the exceptions for the rest which don't.
write a new function toUnicode(glyphName) that returns a Unicode string for an AGL-compliant glyph name.
Why not just return an integer (code point) and avoid having to deal with wide/narrow build issues?
Because the unicodedata module functions expect Unicode strings.
We now have name
, production_name
, unicode
, category
, and subcategory
. As far as I can tell, these properties should be sufficient for converting Glyphs files to UFO. If we need additional properties (eg. the descriptive name), it’d be easy to add them.
FWIW would be nice if we had the XML-parsing code as well, such that custom GlyphsData.xml was also supported. When doing that, I'm not sure if keeping the default use s separate codepath is a good idea.
Also, I believe GlyphsData.xml changes over time as well. How are we supposed to deal with that?
GlyphsData.xml changes over time as well. How are we supposed to deal with that?
by periodically re-running MetaTools/generate_glyphdata.py
, no?
would be nice if we had the XML-parsing code as well, such that custom GlyphsData.xml was also supported
The generate_glyphdata.py
does all that already, the only thing it is hard-coded to fetch the official GlyphData.xml from schriftgestalt/GlyphsInfo
repository.
But we could add a command-line option to pass the path to a local custom GlyphData.xml file as an alternative to downloading the official one.
Then users could pass their own custom glyphdata_generated
module as data
argument to glyphdata.get_glyph
function.
To convert Glyphs files to UFOs in the exact same way as Glyphs.app, glyphsLib will need the two data files GlyphData.xml and GlyphData_Ideographs.xml from @schriftgestalt’s GlyphsInfo repository. Most of the content is derived from either the Adobe Glyph List (AGL) or from Unicode data files. However, the data in Glyphs has “some adjustments”. How should glyphsLib want to deal with this?
We could copy the XML files from upstream, and write some code in glyphsLib that parses the data files upon first use. This would be very simple to implement. On the minus side, this would make glyphsLib slow to use, because the XML files would have to be parsed. Also, we’d ship some 5MB of data that is mostly redundant because AGL is already part of fonttools, and the Unicode data is part of the core Python libraries.
We could write a tool that generates a few Python data structures from (GlyphData.xml, GlyphData_Ideographs.xml, AGL-in-fonttools, unicodedata). The generated Python file would only store the deltas, so it will likely be a rather small table. Some accessor code in glyphsLib would first consult this “adjustments table.” If the adjustments table contains no entry, the code would resort to fonttools (for AGL-derived data), respectively to unicodedata (for Unicode-derived data). A minor complication would be version skew. AGL is stable, but Unicode character classes change over time. Therefore, it might make sense to base things on unicodedata.ucd_3_2_0, at the cost of having slightly bloated adjustment tables in glyphsLib.
Thoughts? I’m tending to go with the second approach, but wanted to get everyone’s feedback first.