Have a bunch of benchmarks

madig commented 2 years ago

I every now and then do profiling of UFO loading (and writing). I think it would be nice to have a bunch of benchmarks (https://doc.rust-lang.org/cargo/commands/cargo-bench.html) ready to be run. They could also serve as entry points for profilers.

I'd say we could import a copy of Noto Sans and maybe even a custom to-UFO translation of Noto Sans CJK to have norad work up a sweat. Need to figure out the CJK part still, until the sources are opened.

Scenarios:

Serial loading of UFOs, no parallel loading of glifs (without rayon feature)
Serial loading of UFOs, parallel loading of glifs (with rayon feature)
Parallel loading of UFOs, no parallel loading of glifs
Parallel loading of UFOs, parallel loading of glifs
Loading a single UFO without any font or glyph libs
Loading a single UFO where every glyph has a lib with lotsa stuff and the font lib is big (to hammer plist)

Could also include the line-ending benches in https://github.com/linebender/norad/pull/172#issuecomment-906413066.

chrissimpkins commented 2 years ago

I really like this idea. Are the GHA runners reliable enough envs to run these benchmarks? If not, how would you standardize the execution/reporting?

madig commented 2 years ago

I wasn't thinking about GHA actually, I'm not sure how good CI infrastructure is for reliable benchmarking. This is more aimed towards running benches on various machines easily to compare e.g. platform diffs. Benchmarking on commits does sound enticing, though...

chrissimpkins commented 2 years ago

https://fast.vlang.io/ appears to use a free instance on AWS, possibly related to #175 too

cmyr commented 2 years ago

I wouldn't benchmark on CI infrastructure, and generally wouldn't want to benchmark on a virtual machine. I do think benchmarks are important, although I would prefer criterion to the built in cargo bench.

madig commented 2 years ago

Another idea: look for quadratic runtime by having a massive CJK UFO and loading incrementally more of it and seeing if the timings form a line or upward slope. Also comparison and other things you can do to objects.

chrissimpkins commented 2 years ago

having a massive CJK UFO

I looked into Noto CJK sources. They are not available and won't be in the near term.

madig commented 2 years ago

I made a 60k glyph amalgamation of Noto at https://github.com/madig/noto-amalgamated. It's just the Regular for now, maybe I should do an amalgamation for all Designspace extremes? Need to think about what and how I want to benchmark.

BTW: I profiled the amalgamation script and was amazed to find out that ~2 mins of the 9-10 mins runtime are spent in ufoLib.filenames.userNameToFileName. What the hell.

madig commented 2 years ago

Looking at this :thinking: So, criterion is built such that if you want to compare rayon to no rayon, you run cargo criterion --features rayon instead of changing the benchmarks. This leaves what to benchmark and how.

I currently have Mutator Sans as a small UFO collection, a recent Noto Sans as a medium-size UFO collection but with 15 masters and one huge Noto Amalgamated. I know that plist loading influences parsing time; 15-25% of Mutator Sans glyphs have a lib, almost all glyphs in Noto Sans do, 75% in Noto Amalgamated do. I had the idea of measuring with and without plists and stuff, but maybe I should keep it real and take the 3 UFO families as they are, for now, until I have a clearer idea of what I want to benchmark and why.

So, maybe I'll make a new data repo with Mutator Sans, Noto Sans and Noto Amalgamated (with maybe all points in the Designspace, amalgamated) and hook that in as a sub-repo, and test serial loading in each group plus parallel loading (launching 1 thread per UFO to load). Then I can bench with --features rayon and without?

Edit: just saw that a Noto amalgamated by style name gives me a nice progression of glyph numbers. I can bench that.

madig commented 2 years ago

Interestingly, there does seem to be some quadraticness going on without rayon? X-axis is number of glyphs (amalgamated Noto has a nice glyph number progression), Y-axis is load time in seconds. Not loading glyph libs halves loading time but the graph keeps the slope. Or am I reading the graph wrong?

grafik

cmyr commented 2 years ago

I don't think the graph is especially clear, it isn't far from being a straight line, and there's always the possibility of measurement noise.

chrissimpkins commented 2 years ago

I wandered across this project from the Criterion developer that claims to be a way to support benchmark tests on CI infrastructures:

https://github.com/bheisler/iai

Precision: High-precision measurements allow you to reliably detect very small optimizations to your code Consistency: Iai can take accurate measurements even in virtualized CI environments Performance: Since Iai only executes a benchmark once, it is typically faster to run than statistical benchmarks Profiling: Iai generates a Cachegrind profile of your code while benchmarking, so you can use Cachegrind-compatible tools to analyze the results in detail

Valgrind-based, Linux only IIUC.

chrissimpkins commented 2 years ago

Can confirm that iai functions on GH Actions Ubuntu runner CI with an apt install of valgrind, and data appear to be relatively stable across runs. Cannot confirm accuracy, nor whether the data are useful for performance improvement work (yet)... :)

linebender / norad

Have a bunch of benchmarks #177