googlefonts / oxidize

Notes on moving tools and libraries to Rust.
Apache License 2.0
173 stars 7 forks source link

Examine the distribution of Fonts compile times with fontmake and define "slow compile" #25

Open chrissimpkins opened 2 years ago

chrissimpkins commented 2 years ago

Define a test environment and analyze the distribution of Fonts catalog compile times with fontmake. We'll use the distribution to analyze outliers and improve our understanding of the concept of "slow". Slow might mean more font formats to compile, more outlines to compile, more complex layout to compile, more breadth of design space to compile, etc.

Related #24

chrissimpkins commented 2 years ago

Thoughts about an appropriate environment to use for testing?

chrissimpkins commented 2 years ago

@simoncozens What is your general sense about the consistency of build workflows across the Fonts catalog? Is there a way that we might automate the builds using a standardized build approach?

simoncozens commented 2 years ago

Is there a way that we might automate the builds using a standardized build approach?

Well, this is why we set up the googlefonts-project-template and specifically the gftools-builder, so that there would be a standard approach for upstreams. Here's a quick and dirty count of how many repositories are (probably) using that structure.

rsheeter commented 2 years ago

I dont' know that we have to nail down the environment, just to be sure to capture what it was when reporting any result.

Clear instructions on how to test build as much of Google Fonts as possible would be helpful. IIUC it's something like:

  1. get machine ready to build fonts
  2. clone google/fonts
  3. clone all the things you find via upstream.yaml files
  4. try to build those things
davelab6 commented 2 years ago

The new https://googlefonts.github.io/gf-guide is probably the best place to put docs about how to set up a clean machine. Then template repo can link there.

@twardoch has a "font engineering toolbox" meta package or doc about all the kinds of things you want to have installed when jiving around here. Adam where's the best link for that?

rsheeter commented 2 years ago

Thanks Dave, that looks like the missing link. Following how to build links from gf-guide led me to https://googlefonts.github.io/gf-guide/tools.md which curiously spews raw markdown at me, maybe it's malformed in a way that makes GH angry and tired?

simoncozens commented 2 years ago

at a glance doesn't appear to specify how to setup a clean machine to the point where make build will work.

googlefonts-project-template is written in such a way (minimal dependencies, automatically installing any required Python modules into a virtual environments) that it shouldn't require any setup to work. But we should probably document that fact.

https://googlefonts.github.io/gf-guide/tools.md which curiously spews raw markdown at me,

Fixed it. (Or rather, fixed the link, which should go to the rendered version instead.)

simoncozens commented 2 years ago

One more thought: within the next few weeks I will be splitting the noto-source repository into per-project repositories. Each of these repos will have a standard build process, runnable both via CI and from a simple "make build" on the commandline. We will therefore have 158 standardised font projects, some large and some small, that we can use for benchmarking.

rsheeter commented 2 years ago

158 standardised font projects, some large and some small, that we can use for benchmarking

Nice!

How would one go about finding those Noto repos? Will they have upstream.yaml files in google/fonts or is some other mechanism required?

simoncozens commented 2 years ago

I’ll ensure there’s a JSON feed somewhere off notofonts.github.io

chrissimpkins commented 2 years ago

I spoke with Simon today. He plans to use the same set of build dependency versions across all Noto families and a consistent compile make target in each repository. Local execution of the compiles will be simple. I'll run the compile time tests on the Noto projects when the build workflow updates indicated in https://github.com/googlefonts/oxidize/issues/25#issuecomment-1150800349 are available.

rsheeter commented 2 years ago

Does noto build with gfbuilder or some other mechanism?

simoncozens commented 2 years ago

It will be building with an extension of gftools-builder; this is because Noto also has a range of artefacts (hinted and unhinted TTF, OTF and variable where possible, as well as builds that include subsets of Noto Sans/Serif).

rsheeter commented 2 years ago

What does "an extension" look like? - it's very useful to be able to checkout based on upstream.yaml, find the config files [aside: wish they used a common name or location, when I tried grabbing upstream repos they did not seem to], and point the common builder at them.

Can we could enhance the common builder to support Noto so it's consistent?

simoncozens commented 2 years ago

Upstream repos should use config.yaml for builder config. Noto doesn't because there are multiple families in the same repo.

Of course there's nothing stopping you using gftools-builder on a Noto repo if you're not particularly interested in the full Noto range of outputs.

rsheeter commented 2 years ago

Can we could enhance the common builder to support Noto so it's consistent? If Noto needs many families can it simply have many config files? - I see other upstream repos doing that. For example, from a quick attempt to grab all the upstreams in google/fonts:

386 upstream.yaml files exist in google/fonts
91 have no repository_url
 2 have invalid repo urls (clone fails); https://github.com/TypeNetwork/Alegreya and https://github.com/googlefonts/glory

If I checkout all the repository_urls that work I end up with 293 directories. In those directories 121 yaml files that "look like" (have sources and familyName) gfbuilder configs exist. config.yaml is a popular name but some repos, particularly the lexend ones, have lots:

# config.yaml is popular but not completely consistent
$ for s in $(python ls-config.py); do basename $s; done | sort | uniq -c
     10 Baloo2.yaml
     10 BalooBhai2.yaml
     10 BalooBhaijaan2.yaml
     10 BalooBhaina2.yaml
     10 BalooChettan2.yaml
     10 BalooDa2.yaml
     10 BalooPaaji2.yaml
     10 BalooTamma2.yaml
     10 BalooTammudu2.yaml
     10 BalooThambi2.yaml
     28 builder.yaml
      1 build.yaml
      2 config_mono.yaml
      2 configNegative.yaml
     58 config.yaml
      8 deca.yaml
      8 exa.yaml
      8 giga.yaml
      8 lexend-CI.yaml
      8 lexend.yaml
      8 mega.yaml
      8 peta.yaml
      8 tera.yaml
      8 zetta.yaml

# Lexend (several repos) does this layout
lexendpeta/sources/lexend-CI.yaml
lexendpeta/sources/lexend.yaml
lexendpeta/sources/giga.yaml
lexendpeta/sources/mega.yaml
lexendpeta/sources/peta.yaml
lexendpeta/sources/exa.yaml
lexendpeta/sources/tera.yaml
lexendpeta/sources/deca.yaml
lexendpeta/sources/zetta.yaml

# How many families are we covering?
$ awk -F / '{print $1}' build_config.txt | sort | uniq | wc -l
96

EDIT: filed https://github.com/google/fonts/issues/4772 requesting a curated list of interesting test cases. I'm out until mid-July for non-work reasons so I figure if we have that by end of July we're in good shape. Since we support Noto on fonts.google.com if we want to include Noto compile scenarios I would ideally like to see them work consistently: follow upstream.yaml to repo then do something consistent to build them. If the build process truly must differ for Noto them maybe upstream.yaml needs to advise me how to build them...? EDIT2: config.yaml situation is better than I thought, it seems only sources is required (per https://github.com/googlefonts/gftools/blob/main/Lib/gftools/builder/__init__.py) and I had looked for both sources and familyName. Updated numbers above.

simoncozens commented 2 years ago

Can we could enhance the common builder to support Noto so it's consistent?

I don't think so. The two things Noto specifically needs are

  1. Two separate output artefacts for unhinted and hinted - this isn't a big deal, and I suppose could conceivably go into gftools-builder.
  2. The ability to merge in a subset of Noto Sans / Noto Serif / Noto Devanagari (common script Vedic extensions, urgh) and create another output font containing that subset. This means cloning the latest Noto CJK/Noto Deva repo as part of the build, and doing a UFO merge.

The second one is obviously the bigger deal, and is very Noto specific. I don't see it as something that is at all useful for other GF library fonts, since we already require them to have a GF Latin core glyph set as part of their sources. (The Noto builder inherits from gftools.builder, so it's not completely doing its own thing.)

(I will write up a doc explaining the thinking around the new Noto build/repo system soon. There is a design doc, but some of the ideas have developed over time. A lot of it is designed around the idea of how to make the tooling consistent and manageable when you have hundreds of repos; so for example, the point of using a separate notobuilder module is that it gives you a useful layer of abstraction where you can define your build dependencies and processes and library versions in one place, and if you decide to move to a new version of, say, glyphsLib, you can change the version in the notobuilder repo rather than having to go around every single individual script project repo.)

If Noto needs many families can it simply have many config files? - I see other upstream repos doing that.

That's how it works; I thought we were talking about discovery of the build process and I was just saying that the config file should be named config.yaml in almost all cases, but obviously if there's more than one family in the repo you have a separate config file for each, so discovery is a tiny bit harder.

However, Noto does follows the consistent googlefonts-project-template approach of having a Makefile with a "make build" target, so you should be able to clone any googlefonts-project-template / noto-project-template repo and run make build on it.

I've added something to produce that JSON file now, too, so once we've pulled the switch, you can build all Noto with

curl https://notofonts.github.io/noto.json | jq '.[] | .repo_url' | xargs -n1 git clone ; for i in */; do; cd $i && make build && cd ..; done
khaledhosny commented 2 years ago

The second one is obviously the bigger deal, and is very Noto specific. I don't see it as something that is at all useful for other GF library fonts, since we already require them to have a GF Latin core glyph set as part of their sources.

Many of the fonts I build for GF have separate Latin sources (I almost never design Latin, so I use an existing Latin design and keep the sources separate for easy updates) and I know at least one other foundry that does the same, so such merge-sources-at-build-time should probably be generally available if it is going to be developed.

rsheeter commented 2 years ago

Interesting, I had not imagined merge to be in scope for a fast compiler once the "make lots of statics and merge for VF" step was gone.

@simoncozens noob question, where does it say to run make build? - for context I looked at:

  1. https://googlefonts.github.io/gf-guide/#production-compiling-your-fonts-for-gf, ok I do indeed want to Build the fonts
  2. https://googlefonts.github.io/gf-guide/build.html, which rambles a bit (I just want to build, why does it start by teling me about fontmake dependencies?) and then seemingly guides me to gftools builder in https://googlefonts.github.io/gf-guide/build.html#gftools-builder

EDIT: I don't seem to see very many Makefile, find . -name 'Makefile' | wc -l returns 24. @m4rc1e are there supposed to be Makefile's? EDIT2: 4 out of 5 make build I tried fail, filed https://github.com/google/fonts/issues/4773.

simoncozens commented 2 years ago

Interesting, I had not imagined merge to be in scope for a fast compiler once the "make lots of statics and merge for VF" step was gone.

Yeah, this is a different kind of merging. Rather than merging at the binary level, we're merging glyphs from some sources into the font at the UFO level, before the build happens. We do this because we need those glyphs to interact with layout.

Take the case of Vedic marks - if you have a Sharada font with above and below anchors for your marks, you want to add the Vedic marks from the Devanagari font. But if you do that with, say, pyftmerge, nothing generates mark-to-base rules to connect the Sharada anchors to the Vedic anchors. So the glyphs need to be added to the UFO sources before layout generation happens.

It's not something you would think of as "in scope for compilation", but it is something that needs to happen as part of the process of compiling Noto fonts.

@simoncozens noob question, where does it say to run make build? EDIT: I don't seem to see very many Makefile, find . -name 'Makefile' | wc -l returns 24.

The backstory is that there are kind of two levels of standardisation here. Once upon a time, upstreams could have whatever repo structure and whatever build scripts they wanted so long as they made fonts that worked.

The first level of standardisation was to gather up all the build scripts and work out what they had in common; out of that we put together the gftools-builder as a replacement for ad hoc build scripts.

The next level of standardisation was to provide a template repository structure so that designers could get their font projects up and running quickly (and with easy build steps and GitHub actions, could get on with designing fonts instead of messing about getting the builds working) and also that when onboarders came alongside the designers, they didn't have to spend time working out where everything was. So that's googlefonts-project-template. The Makefile is part of that repo structure, so any repos created out of googlefonts-project-template should be buildable with make build. We describe that at https://googlefonts.github.io/gf-guide/upstream.html - but we don't talk specifically about the Makefile there because one of the benefits of that repo that we want to sell to designers is that they never actually need to worry about interacting with the libre toolchain (sure they can carry on doing their proofs with Glyphs.app export or whatever) because everything is automatically done for them with GitHub actions and they can just pick up the build artefacts in the actions tab.

So the reason you may be seeing config.yamls but not Makefiles is that we may have moved some repos to standard build tools, but not yet to the standard repository structure, because retrofitting a repo structure is a bigger deal than changing your build scripts.

simoncozens commented 2 years ago

(Argh, hit wrong button while editing)

Possibly stupid question: Have we defined "compile"? I ask because the profile of different compilation jobs will look different.

For example, in gftools-builder, we essentially run fontmake three times, to create variable font and static interpolated OTFs and static interpolated TTFs. There's an obvious speedup there in re-using the by-products (interpolated UFOs) of the OTF run to generate the TTFs. Creating variable fonts has fewer shared by-products compared with compiling statics, because compiling a variable font doesn't require UFO interpolation (which seems, just by gutfeel, to be a pretty slow step).

I propose we test for two output profiles: just a variable TTF, and variable TTF + static OTF + static TTF.

behdad commented 2 years ago

I propose we test for two output profiles: just a variable TTF, and variable TTF + static OTF + static TTF.

In another issue I called those a narrow build, and a fat build. Another interested case is an incremental build.

chrissimpkins commented 2 years ago

Possibly stupid question: Have we defined "compile"? I ask because the profile of different compilation jobs will look different.

Not a stupid question at all. That is very much the goal of the "define slow compile" part of this thread. We want to understand what the distribution of Python pipeline project compile times are (projects yet to be defined, https://github.com/googlefonts/oxidize/issues/24 is an attempt to collect information about experiences with "slow compiles" using fontmake) and then investigate why the tails of the distribution take the times that they do. My understanding from an earlier discussion was that we might begin with a broad, multi-script set of projects that include multiple compiled artifacts per project (Noto). This may be an interesting place to start because the question why some projects compile "fast" and lie on the other end of the distribution might be informative too.

chrissimpkins commented 2 years ago

I propose we test for two output profiles: just a variable TTF, and variable TTF + static OTF + static TTF.

In another issue I called those a narrow build, and a fat build. Another interested case is an incremental build.

Do we know if the full Noto catalog covers all of these cases?

simoncozens commented 2 years ago

This isn't a factor of the font but a factor of the build process. Noto is currently built using a custom process which only does what we're calling fat builds, because we need a range of artefacts to suit the needs of Android, GF, Linux distros, etc.

But I suggest for our data gathering we just use plain fontmake. If you run fontmake -o variable you get a narrow build and if you run fontmake -i -o ttf otf variable you get a fat build.

Now I come to think of it, I remember @madig has already put together a framework to do this data gathering - he had a script which downloaded and built a range of fonts using a variety of compile tools / versions / etc. https://daltonmaag.github.io/pipeline-perf-tracker/results/

rsheeter commented 2 years ago

For this issue I had intended to focus on times for specific fontmake invocations, as @simoncozens suggests above.

chrissimpkins commented 2 years ago

For this issue I had intended to focus on times for specific fontmake invocations, as @simoncozens suggests above.

Source inputs: full Noto project catalog? Comparison groups: (1) full Noto catalog with narrow build process; (2) full Noto catalog with fat build process?

Or are you interested in a single distribution of compile times that includes both build processes on each set of project sources?

simoncozens commented 2 years ago

Here's the Stupidest Simplest Thing That Could Possibly Work: https://github.com/simoncozens/time-font-compilation/blob/main/.github/workflows/build.yml

Fat build disabled for now because of googlefonts/fontmake#912

Results coming in at e.g. https://github.com/simoncozens/time-font-compilation/actions/runs/2666743230

rsheeter commented 2 years ago

IIUC that basically says forget using upstream files or building a large set - something I had hoped would be possible due to upstream.yaml and standardized compilation - entirely and just pick a specific few? In effect it proposes an answer to https://github.com/google/fonts/issues/4772?

simoncozens commented 2 years ago

Fair enough; I was trying to make it work with CI where we only have six hours to get everything done. Of course if we run the tests locally, we can run for as long as we like. I'm out of the habit of thinking about building fonts locally. :-)

simoncozens commented 2 years ago

OK, here's the build stats for all of Noto: https://gist.github.com/simoncozens/c896bd0fae2ae353b2fad63ca425dadd

Note that we don't really have hour-long monsters like Roboto Serif or whatever. Most are single-master files with less than a couple of hundred glyphs.

simoncozens commented 2 years ago

noto

chrissimpkins commented 2 years ago

Pretty convincingly linear except for the Urdu Nastaliq outlier.

IIUC those times are compiles to VF format. Is the time relationship against glyphs masters also linear for static instance compiles? And what is the time / (glyphs masters) slope for static instance compiles relative to VF format compiles?

(real time) / (glyphs * masters) seems like it could be a useful ratio for cross-project compile time performance optimization testing.

behdad commented 2 years ago

OK, here's the build stats for all of Noto: https://gist.github.com/simoncozens/c896bd0fae2ae353b2fad63ca425dadd

I'm trying to graph this data as well, but bunch of the entries don't have "masters" or "glyphs". They are also clustered in your graph at zero. Eg. "kufi-arabic".

behdad commented 2 years ago

Here's a log2-log2 view of the same: Figure_1

simoncozens commented 2 years ago

Pretty convincingly linear except for the Urdu Nastaliq outlier.

Right, but that outlier means "All bets are off once your layout rules get very complicated."

rsheeter commented 2 years ago

This is great to see!

Forgive my density but how do I reproduce the timing collection? Is there a script or something to get https://gist.github.com/simoncozens/c896bd0fae2ae353b2fad63ca425dadd as measured on my machine?

simoncozens commented 2 years ago

Here's my terrible build script:

https://gist.github.com/simoncozens/173c9d35e28c1c6e43e58405d0c4695b

simoncozens commented 2 years ago

IIUC those times are compiles to VF format. Is the time relationship against glyphs * masters also linear for static instance compiles?

It'll probably be linear in glyphs instances, but I'll run a build. This is where things do* get slow in Noto because we have a number of fonts with 18 instances or so. Interpolating instance UFOs takes a while. (Finishing off triangulate would lead to an obvious win here.)

simoncozens commented 2 years ago

Here's the data and plot for static builds.

https://gist.github.com/simoncozens/4b859f672e047b51da1f127ba250eae6

noto-static

> summary(df$real)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.36    0.74    2.05   29.90   12.56  693.24 
chrissimpkins commented 2 years ago

Time scale is min correct? It takes over 11 hours to compile all Noto Sans LGC statics sequentially?

simoncozens commented 2 years ago

Seconds.

chrissimpkins commented 2 years ago

You are compiling ~50% of all Noto project static instances in under 2 sec? Or am I misunderstanding your summary stat table?

simoncozens commented 2 years ago

More than half of Noto fonts are single-instances with <200 glyphs. So yes, we get through most of them in almost no time. Noto is possibly a skewed data set. :-) But the more interesting data is in the bigger fonts, and we have, I think, shown a linear relationship.