Open malbarbo opened 5 years ago
An option to make tectonic faster is to use a non-cryptography hash function, like https://github.com/Cyan4973/xxHash.
Yeah... But I am a bit surprised that it is twice as slow. I did make a flamegraph a long time ago where one can see where sha2 shows up: https://tectonic.newton.cx/t/profiling-tectonic/32?u=rekka
When testing at that time, tectonic was only a tiny bit slower: https://tectonic.newton.cx/t/profiling-tectonic/32/3?u=rekka
I should probably try again on the current version
Here's a rust implementation of xxHash, for reference: https://crates.io/crates/twox-hash
In my measurements long time ago, the difference is bigger for small documents since in that case a lot of time is spend in loading the format file, which is decompressed (gzip) and sha2 hashed. Surprisingly more time is spend in computing crc32 than sha2 :)
There is some low-hanging fruit in the io layer of tectonic for sure.
It is actually rather trivial to remove almost all sha256 computation since most of it is never needed, see pull request #453.
Tectonic also does not compress the format file anymore, so that is not an issue.
I'll try building a flamegraph and perhaps documenting how to profile. While converting a 76 page document using recent builds from master, I saw it take 90 seconds or more on macOS.
@efx You mean 90 s with only using local cache? Or does this include downloading files form the online bundle?
@efx Yeah, would like to see any profiling you can muster on that — we'll be slower, but that's a lot slower, I imagine.
@rekka I think in a couple cases it was both. I'll try to put together a reproducible sample to also compare on a Linux machine. Off the top of my head, trying out #431 saved 10-20 seconds.
I ran both on macOS from a newly compiled version (f8b1590). Latest master after the assets have been cached:
[nix-shell:~/rust/tectonic]$ time tectonic -C -c minimal -r 1 tests/xenia/paper.tex
real 0m1.108s
user 0m0.925s
sys 0m0.158s
The example where I see the bigger slowdown is autogenerated LaTeX from pandoc. It has a number of images and we are using a custom TTF font. My hunch is that bottlenecks the PDF generation as I observed 2-3x slowdowns in a 1 page document when using this font.
$ time tectonic -C -c minimal my-other-document.tex
warning: Object @table.0.1 already defined.
warning: Object @table.0.2 already defined.
warning: Object @table.0.1 already defined.
warning: Object @table.0.2 already defined.
warning: Object @table.0.1 already defined.
warning: Object @table.0.2 already defined.
warning: Object @table.0.3 already defined.
warning: Object @table.0.1 already defined.
warning: Object @table.0.2 already defined.
real 1m17.029s
user 1m16.134s
sys 0m0.788s
@efx Hm, I've never seen such a slow compilation. Even beamer presentations or figure heavy documents take only a few seconds to compile for me.
Can you remove -c minimal
if possible, to see how many times it runs the engine? And can you profile the run? One way is to use https://github.com/ferrous-systems/flamegraph. You can run it on my profile branch https://github.com/rekka/tectonic/tree/profile, just with (if I remember correctly)
cargo flamegraph --bin tectonic -- texfile.tex
This should generate flamegraph.svg
. I am not sure how this works on macOS, supposedly using dtrace
. I only used it on linux.
Interesting. I will see if I can get dtrace permissions / run on another machine tomorrow:
Finished release [optimized + debuginfo] target(s) in 3m 00s
dtrace: system integrity protection is on, some features will not be available
dtrace: failed to initialize dtrace: DTrace requires additional privileges
failed to sample program
I could not recreate the subsequent slowdowns for the long-document.tex
here: https://github.com/efx/tectonic/tree/topic-profile
I generated a flamegraph while converting my document on a Linux machine: https://github.com/efx/tectonic/blob/run-flamegraph/flamegraph.svg
And here is the full log output:
Thanks!
Unfortunately, in this case it doesn't seem to be tectonic's overhead. You just have a beast of a document. I'd guess lots of (png?) images. You also mentioned a large font. I do not know how that is handled in the pdf file.
Even thought there are 3 tex passes, they take together only 20% of the runtime. The main tectonic's overhead, sha256 digest computation is only 2% of the runtime. I would think that xetex will be about as slow as tectonic. (If you have a chance to compare with xetex run time, it would be helpful.)
About 80% is spent compressing pdf objects at the output stage and I believe that xetex has the same default behavior. Both xetex and tectonic use zlib's compression level 9: the slowest compression. There might be some discussion whether it is worth it, see Figure 1 at https://clearlinux.org/news-blogs/linux-os-data-compression-options-comparing-behavior.
From a brief reading of the xdvipdfmx backend, I'd guess that png images are decompressed and then compressed again as pdf objects. I tested with a disabled compression on a file of mine with a lot of pngs, and the compression ratio is above 50:1 on the pdf file so it might not be a good idea to disable the compression. :)
Thank you for diagnosing and explaining the problem.
Unfortunately, in this case it doesn't seem to be tectonic's overhead. You just have a beast of a document. I'd guess lots of (png?) images. You also mentioned a large font. I do not know how that is handled in the pdf file.
Bingo. It has 99 PNG files being inserted, alongside a logo on each page (74 pages). I will track down a copy sometime next week and see.
These images have been compressed too so it would be nice if there was a configuration option to disable compression of images while generating a PDF.
These images have been compressed too so it would be nice if there was a configuration option to disable compression of images while generating a PDF.
Currently, xetex (and pdflatex it seems) does decompress all png images in the following function: https://github.com/tectonic-typesetting/tectonic/blob/84cdc69b51ab5c52db0ab3a3a317d9deaf35a079/tectonic/dpx-pngimage.c#L154
From what I read about pdf and png formats, this is not necessary for non-interlaced pngs without an alpha channel (transparency) because in that case the png data stream and pdf image stream are completely compatible. Here's a code that takes advantage of this for reference: img2pdf.
There is also another advantage to copying the data directly, besides saving time, since it is possible to optimize png compression by tools like oxipng and the decompression/compression cycle breaks it.
It would be easy enough to test if the png format is compatible and just copy the data directly.
But there are some issues like gamma correction and color spaces and I am not too familiar with those.
@malbarbo Now when #453 is merged, could you try rerunning the test on your machine with the version on master
branch? Please note that -r
is the number of tex runs after the first, so your test was running tex twice (and also bibtex). Therefore run
time tectonic -r 0 paper.tex
Still runs bibtex but that's quick anyway. -c minimal
is unnecessary.
On tests/xenia/paper.tex
, I am getting the opposite in fact, tectonic is almost twice (!?) as fast as xelatex:
$ hyperfine 'tectonic -r 0 paper.tex' 'xelatex -interaction=batchmode paper.tex'
Benchmark #1: tectonic -r 0 paper.tex
Time (mean ± σ): 552.0 ms ± 29.5 ms [User: 451.6 ms, System: 92.9 ms]
Range (min … max): 520.9 ms … 630.0 ms 10 runs
Benchmark #2: xelatex -interaction=batchmode paper.tex
Time (mean ± σ): 925.4 ms ± 8.7 ms [User: 863.7 ms, System: 94.3 ms]
Range (min … max): 911.2 ms … 941.3 ms 10 runs
Summary
'tectonic -r 0 paper.tex' ran
1.68 ± 0.09 times faster than 'xelatex -interaction=batchmode paper.tex'
macbook pro Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz.
xelatex version
$ xelatex --version
XeTeX 3.14159265-2.6-0.999991 (TeX Live 2019)
kpathsea version 6.3.1
Copyright 2019 SIL International, Jonathan Kew and Khaled Hosny.
There is NO warranty. Redistribution of this software is
covered by the terms of both the XeTeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the XeTeX source.
Primary author of XeTeX: Jonathan Kew.
Compiled with ICU version 63.1; using 63.1
Compiled with zlib version 1.2.11; using 1.2.11
Compiled with FreeType2 version 2.9.1; using 2.9.1
Compiled with Graphite2 version 1.3.13; using 1.3.13
Compiled with HarfBuzz version 2.3.1; using 2.3.1
Compiled with libpng version 1.6.36; using 1.6.36
Compiled with poppler version 0.68.0
Using Mac OS X Core Text and Cocoa frameworks
Just for fun with -r 1
. Still faster!
$ hyperfine 'tectonic -r 1 paper.tex' 'xelatex -interaction=batchmode paper.tex'
Benchmark #1: tectonic -r 1 paper.tex
Time (mean ± σ): 808.1 ms ± 10.2 ms [User: 661.3 ms, System: 139.0 ms]
Range (min … max): 794.3 ms … 824.1 ms 10 runs
Benchmark #2: xelatex -interaction=batchmode paper.tex
Time (mean ± σ): 934.8 ms ± 17.0 ms [User: 878.6 ms, System: 92.3 ms]
Range (min … max): 917.7 ms … 972.2 ms 10 runs
Summary
'tectonic -r 1 paper.tex' ran
1.16 ± 0.03 times faster than 'xelatex -interaction=batchmode paper.tex'
@rekka Thanks for working on this, you got really incredible results!
I noticed later that I should use -r 0
instead of -r 1
. Anyway, I got the following results from master:
> time tectonic -C -r 1 paper.tex
real 0m1,475s
user 0m1,268s
sys 0m0,208s
and
> time tectonic -C -r 0 paper.tex
real 0m0,950s
user 0m0,839s
sys 0m0,109s
and
$ hyperfine 'tectonic -r 0 paper.tex' 'xelatex -interaction=batchmode paper.tex'
Benchmark #1: tectonic -r 0 paper.tex
Time (mean ± σ): 946.0 ms ± 3.9 ms [User: 838.8 ms, System: 107.6 ms]
Range (min … max): 941.5 ms … 952.5 ms 10 runs
Benchmark #2: xelatex -interaction=batchmode paper.tex
Time (mean ± σ): 905.1 ms ± 6.5 ms [User: 863.4 ms, System: 160.5 ms]
Range (min … max): 895.2 ms … 917.5 ms 10 runs
Summary
'xelatex -interaction=batchmode paper.tex' ran
1.05 ± 0.01 times faster than 'tectonic -r 0 paper.tex'
Although I could not get tectonic to run as fast as xelatex, definitively https://github.com/tectonic-typesetting/tectonic/pull/453 improved the results.
Trying some files which I have performance problems using tectonic I found one extreme case that is still not solved by https://github.com/tectonic-typesetting/tectonic/pull/453: using beamer with metropolis theme. Tectonic takes 42s seconds to process the following file
\documentclass{beamer}
\usetheme{metropolis}
\begin{document}
\begin{frame}
Test
\end{frame}
\end{document}
And the culprit is the digest calculation:
83.17% tectonic tectonic [.] sha2::sha256_utils::compress256
3.53% tectonic tectonic [.] get_next
@malbarbo Thanks for checking. Glad to hear that for tests/xenia
the performance now matches xelatex on your machine.
OK, so just to check the difference between mac and linux, I ran the test again on Ubuntu 18.04 with Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (hyperfine supports markdown tables!)
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
tectonic -r 0 paper.tex |
375.2 ± 5.0 | 369.7 | 383.1 | 3.4 |
xelatex -interaction=batchmode paper.tex |
110.2 ± 1.9 | 107.7 | 115.0 | 1.0 |
What is going on with xelatex?? So on my linux machine tectonic is significantly slower than xelatex. Tectonic speed improves because it's just a faster cpu, but why is xelatex so much faster than on my mac? I mean, it's an older version:
XeTeX 3.14159265-2.6-0.99998 (TeX Live 2017/Debian)
I am quite shocked by the performance for the metropolis theme. I use it for my presentations with Fira Sans fonts and on both mac and linux it performs just fine. This is the linux machine:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
tectonic -r 0 metropolis-test.tex |
1.206 ± 0.027 | 1.177 | 1.253 | 1.9 |
xelatex -interaction=batchmode metropolis-test.tex |
0.641 ± 0.004 | 0.635 | 0.645 | 1.0 |
sha2 computation takes about 5% of the runtime.
Here's the flamegraph:
Is there something particular to your setup?
@malbarbo In case you want to try it, I replaced sha256 with xxHash for file change tracking on my branch https://github.com/rekka/tectonic/tree/feat-xxhash.
The Fira fonts was not installed in my system. Installing the fonts reduced the time to 5.3s (sha2::sha256_utils::compress256
took 6% of the time).
Using the feat-xxhash branch the time is reduced to 5s (twox_hash::sixty_four::XxHash64::write
took 0,19% of the time, so digest time calculation is minimal).
When using the feat-xxhash branch without Fira fonts installed I got 7.5s. In this case, twox_hash::sixty_four::XxHash64::write
took 9% of the time, so maybe there are more situations where we can avoid computing the digest.
All these numbers corresponds to running tectonic
without -r
arguments (tex run twice).
Thanks for checking; this is very useful.
It must be then reading a lot of font data if Fira is not installed. Hm.
Yes, we almost certainly do not need to compute the digest of system fonts. I'll try to look into disabling that too.
FWIW, here is some output from a 300 page extraction of source code comments, comparing running xelatex twice, with tectonic. This is a random output that was lying around i.e. possibly buggy work in progress TeX output in some unknown state, but renders speed I typically see... tex source
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./foo.sh |
15.564 ± 0.084 | 15.337 | 15.633 | 1.0 |
tectonic ~/lean/mathlib/docs_out/mathlib.tex |
26.647 ± 0.066 | 26.583 | 26.775 | 1.7 |
Thanks for the report! It's curious that I can't really see any part of the flamegraph that would be caused by tectonic's overhead so it is a mystery to me why it performs so much worse than xelatex (on linux only as far as I can tell).
One thing I notice (at least on @rekka 's flamegraph branch), is that when building tectonic/*.c using cargo build -vv --release
is the use of -fno-omit-frame-pointer which at least accounts for some of the differences I see between the disassembly of xelatex's getxtoken and tectonics get_x_token.
I'm not sure if this is just due to the debug info settings in the flamegraph branch.
Add -fno-omit-frame-pointer
with debug info
As well as here:
Setup for profiling
On the master branch running cargo build -vv --release
doesn't cause -fno-omit-frame-pointer
, to be passed to cc
as such the results are almost reversed, perhaps that can be explained by xetex building with -O2 and tectonic compiling the c sources with -O3?
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./foo.sh |
15.622 ± 0.093 | 15.414 | 15.759 | 1.6 |
tectonic ~/lean/mathlib/docs_out/mathlib.tex |
9.585 ± 0.030 | 9.554 | 9.634 | 1.0 |
@ratmice Good sleuthing!
@ratmice Oh, that's a good point. I wasn't sure how much the extra debug info needed for creating the flamegraph changes the timing. Thanks for testing it.
Could you post the xelatex version and system you're running it on? (xelatex --version
)
I observe that the run times are like this: xelatex on Ubuntu (texlive 2017) is much faster than tectonic, which is faster than xelatex on macOS (texlive 2019).
@rekka Fedora 29 x86-64, XeTeX 3.14159265-2.6-0.99999 (TeX Live 2018) here are the build logs for texlive-base, they are a bit long but a search for xetex-xetex should bring up the appropriate flags.
One thing worth mentioning is that on x86-64, perf (and thus flamegraph) is capable of running without -fno-omit-frame-pointer by building the frame pointer from the dwarf, and .eh_frame, so it is possible to profile with less overhead. I can even profile the system package of xetex, by installing the separate debuginfo files. In that case there is no overhead/code gen changes needed for profiling, perhaps profiling encounters less samples not sure.
If i get some time i'll try and build tectonic with -O2, and perhaps build flags the fedora package uses, to get a closer comparison,
but the versions probably differ, and the package is patched so it'd probably be a better comparison to just build from the original version xetex that tectonic derives from if we want to subtract out the difference.
I added an option to cc-rs pull req to allow us to configure whether frame pointers are omitted. I will try to remember to come up with a #[cfg(all(target_os = "linux", target_arch = "x86_64"))]
or other custom cfg, in the branch's build.rs once there is a released version of cc-rs which has this.
This can be observed when generating a pdf for
tests/xenia/paper.tex
.Using
tectonic
from master compiled withcargo install --path .
andxelatex
from Debian 9 in a i3-2330M 2.20GHz machine, I got the following resultsRunning
perf
show that at least 25% of the time is expended by sha2 crate:Besides using hash to check if the engine needs to be rerun, what are the other uses? Is the hash of cached files checked? (I skimmed the code but was unable to answer that).