tectonic-typesetting / tectonic

A modernized, complete, self-contained TeX/LaTeX engine, powered by XeTeX and TeXLive.
https://tectonic-typesetting.github.io/
Other
3.99k stars 162 forks source link

Tectonic is slower than xelatex #452

Open malbarbo opened 5 years ago

malbarbo commented 5 years ago

This can be observed when generating a pdf for tests/xenia/paper.tex.

Using tectonic from master compiled with cargo install --path . and xelatex from Debian 9 in a i3-2330M 2.20GHz machine, I got the following results

> time tectonic -C -c minimal -r 1 paper.tex
real    0m1,642s
user    0m1,530s
sys     0m0,112s

> time xelatex -interaction=batchmode paper.tex
real    0m0,842s
user    0m0,796s
sys     0m0,120s

Running perf show that at least 25% of the time is expended by sha2 crate:

# Overhead  Command   Shared Object             Symbol                                                                                                                                                               
# ........  ........  ........................  ...........................................................................................                                                                          
#                                                                                                                                                                                                                    
    23.78%  tectonic  tectonic                  [.] sha2::sha256_utils::compress256                                                                                                                                  
    10.66%  tectonic  tectonic                  [.] get_next                                                                                                                                                         
     6.03%  tectonic  tectonic                  [.] load_fmt_file                                                                                                                                                    
     3.61%  tectonic  tectonic                  [.] <std::io::buffered::BufReader<R> as std::io::Read>::read                                                                                                         
     3.55%  tectonic  tectonic                  [.] <tectonic::io::InputHandle as std::io::Read>::read                                                                                                               
     3.29%  tectonic  tectonic                  [.] sha2::sha256::Engine256::input                                                                                                                                   
     2.51%  tectonic  tectonic                  [.] macro_call                                                                                                                                                       
     2.17%  tectonic  tectonic                  [.] length                                                                                                                                                           
     1.76%  tectonic  [kernel.kallsyms]         [k] clear_page_rep                                                                                                                                                   
     1.71%  tectonic  tectonic                  [.] tectonic::io::InputHandle::getc                                                                                                                                  
     1.57%  tectonic  tectonic                  [.] flush_list                                                                                                                                                       
     1.50%  tectonic  tectonic                  [.] get_uni_c                                                                                                                                                        
     1.39%  tectonic  tectonic                  [.] tectonic::engines::input_getc                                                                                                                                    
     1.16%  tectonic  tectonic                  [.] id_lookup                                                                                                                                                        
     1.16%  tectonic  tectonic                  [.] ht_lookup_table

Besides using hash to check if the engine needs to be rerun, what are the other uses? Is the hash of cached files checked? (I skimmed the code but was unable to answer that).

malbarbo commented 5 years ago

An option to make tectonic faster is to use a non-cryptography hash function, like https://github.com/Cyan4973/xxHash.

rekka commented 5 years ago

Yeah... But I am a bit surprised that it is twice as slow. I did make a flamegraph a long time ago where one can see where sha2 shows up: https://tectonic.newton.cx/t/profiling-tectonic/32?u=rekka

When testing at that time, tectonic was only a tiny bit slower: https://tectonic.newton.cx/t/profiling-tectonic/32/3?u=rekka

I should probably try again on the current version

rekka commented 5 years ago

Here's a rust implementation of xxHash, for reference: https://crates.io/crates/twox-hash

In my measurements long time ago, the difference is bigger for small documents since in that case a lot of time is spend in loading the format file, which is decompressed (gzip) and sha2 hashed. Surprisingly more time is spend in computing crc32 than sha2 :)

There is some low-hanging fruit in the io layer of tectonic for sure.

rekka commented 5 years ago

It is actually rather trivial to remove almost all sha256 computation since most of it is never needed, see pull request #453.

Tectonic also does not compress the format file anymore, so that is not an issue.

ghost commented 5 years ago

I'll try building a flamegraph and perhaps documenting how to profile. While converting a 76 page document using recent builds from master, I saw it take 90 seconds or more on macOS.

rekka commented 5 years ago

@efx You mean 90 s with only using local cache? Or does this include downloading files form the online bundle?

pkgw commented 5 years ago

@efx Yeah, would like to see any profiling you can muster on that — we'll be slower, but that's a lot slower, I imagine.

ghost commented 5 years ago

@rekka I think in a couple cases it was both. I'll try to put together a reproducible sample to also compare on a Linux machine. Off the top of my head, trying out #431 saved 10-20 seconds.

ghost commented 5 years ago

I ran both on macOS from a newly compiled version (f8b1590). Latest master after the assets have been cached:

[nix-shell:~/rust/tectonic]$ time tectonic -C -c minimal -r 1 tests/xenia/paper.tex

real    0m1.108s
user    0m0.925s
sys 0m0.158s

The example where I see the bigger slowdown is autogenerated LaTeX from pandoc. It has a number of images and we are using a custom TTF font. My hunch is that bottlenecks the PDF generation as I observed 2-3x slowdowns in a 1 page document when using this font.

$ time tectonic -C -c minimal my-other-document.tex
warning: Object @table.0.1 already defined.
warning: Object @table.0.2 already defined.
warning: Object @table.0.1 already defined.
warning: Object @table.0.2 already defined.
warning: Object @table.0.1 already defined.
warning: Object @table.0.2 already defined.
warning: Object @table.0.3 already defined.
warning: Object @table.0.1 already defined.
warning: Object @table.0.2 already defined.

real    1m17.029s
user    1m16.134s
sys 0m0.788s
rekka commented 5 years ago

@efx Hm, I've never seen such a slow compilation. Even beamer presentations or figure heavy documents take only a few seconds to compile for me.

Can you remove -c minimal if possible, to see how many times it runs the engine? And can you profile the run? One way is to use https://github.com/ferrous-systems/flamegraph. You can run it on my profile branch https://github.com/rekka/tectonic/tree/profile, just with (if I remember correctly)

cargo flamegraph --bin tectonic -- texfile.tex

This should generate flamegraph.svg. I am not sure how this works on macOS, supposedly using dtrace. I only used it on linux.

ghost commented 5 years ago

Interesting. I will see if I can get dtrace permissions / run on another machine tomorrow:

    Finished release [optimized + debuginfo] target(s) in 3m 00s
dtrace: system integrity protection is on, some features will not be available
dtrace: failed to initialize dtrace: DTrace requires additional privileges
failed to sample program

I could not recreate the subsequent slowdowns for the long-document.tex here: https://github.com/efx/tectonic/tree/topic-profile

ghost commented 5 years ago

I generated a flamegraph while converting my document on a Linux machine: https://github.com/efx/tectonic/blob/run-flamegraph/flamegraph.svg

And here is the full log output:

``` [nix-shell:~/tectonic]$ sh run.sh warning: tectonic/dpx-mpost.c: In function ‘mps_do_page’: warning: tectonic/dpx-mpost.c:1500:3: warning: ignoring return value of ‘fread’, declared with attribute warn_unused_result [-Wunused-result] warning: fread(buffer, sizeof(char), size, image_file); warning: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ warning: tectonic/dpx-truetype.c: In function ‘pdf_font_open_truetype’: warning: tectonic/dpx-truetype.c:131:32: warning: unknown option after ‘#pragma GCC diagnostic’ kind [-Wpragmas] warning: #pragma GCC diagnostic ignored "-Wstringop-truncation" warning: ^~~~~~~~~~~~~~~~~~~~~~~ Finished release [optimized + debuginfo] target(s) in 0.08s WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted, check /proc/sys/kernel/kptr_restrict. Samples in kernel functions may not be resolved if a suitable vmlinux file is not found in the buildid cache or in the vmlinux path. Samples in kernel modules won't be resolved at all. If some relocation was applied (e.g. kexec) symbols may be misresolved even with a suitable vmlinux or kallsyms file. Couldn't record kernel reference relocation symbol Symbol resolution may be skewed if relocation was used (e.g. kexec). Check /proc/kallsyms permission or run as root. note: this is a BETA release; ask questions and report bugs at https://tectonic.newton.cx/ note: connecting to https://archive.org/services/purl/net/pkgwpub/tectonic-default note: resolved to https://tectonic.newton.cx/bundles/tlextras-2018.1r0/bundle.tar note: downloading index https://tectonic.newton.cx/bundles/tlextras-2018.1r0/bundle.tar.index.gz note: downloading SHA256SUM Running TeX ... note: warnings were issued by the TeX engine; use --print and/or --keep-logs for details. Rerunning TeX because "my-document.toc" changed ... Rerunning TeX because "my-document.toc" changed ... Running xdvipdfmx ... warning: Object @table.0.1 already defined. warning: Object @table.0.2 already defined. warning: Object @table.0.1 already defined. warning: Object @table.0.2 already defined. warning: Object @table.0.1 already defined. warning: Object @table.0.2 already defined. warning: Object @table.0.3 already defined. warning: Object @table.0.1 already defined. warning: Object @table.0.2 already defined. Writing my-document.pdf (27901321 bytes) Skipped writing 2 intermediate files (use --keep-intermediates to keep them) [ perf record: Woken up 279 times to write data ] [kernel.kallsyms] with build id ddbc92abc9ec4a443486b9bc0d90ef295ae22310 not found, continuing without symbols [ perf record: Captured and wrote 70.043 MB perf.data (8693 samples) ] writing flamegraph to "flamegraph.svg" ```
rekka commented 5 years ago

Thanks!

Unfortunately, in this case it doesn't seem to be tectonic's overhead. You just have a beast of a document. I'd guess lots of (png?) images. You also mentioned a large font. I do not know how that is handled in the pdf file.

Even thought there are 3 tex passes, they take together only 20% of the runtime. The main tectonic's overhead, sha256 digest computation is only 2% of the runtime. I would think that xetex will be about as slow as tectonic. (If you have a chance to compare with xetex run time, it would be helpful.)

About 80% is spent compressing pdf objects at the output stage and I believe that xetex has the same default behavior. Both xetex and tectonic use zlib's compression level 9: the slowest compression. There might be some discussion whether it is worth it, see Figure 1 at https://clearlinux.org/news-blogs/linux-os-data-compression-options-comparing-behavior.

From a brief reading of the xdvipdfmx backend, I'd guess that png images are decompressed and then compressed again as pdf objects. I tested with a disabled compression on a file of mine with a lot of pngs, and the compression ratio is above 50:1 on the pdf file so it might not be a good idea to disable the compression. :)

ghost commented 5 years ago

Thank you for diagnosing and explaining the problem.

Unfortunately, in this case it doesn't seem to be tectonic's overhead. You just have a beast of a document. I'd guess lots of (png?) images. You also mentioned a large font. I do not know how that is handled in the pdf file.

Bingo. It has 99 PNG files being inserted, alongside a logo on each page (74 pages). I will track down a copy sometime next week and see.

These images have been compressed too so it would be nice if there was a configuration option to disable compression of images while generating a PDF.

rekka commented 5 years ago

These images have been compressed too so it would be nice if there was a configuration option to disable compression of images while generating a PDF.

Currently, xetex (and pdflatex it seems) does decompress all png images in the following function: https://github.com/tectonic-typesetting/tectonic/blob/84cdc69b51ab5c52db0ab3a3a317d9deaf35a079/tectonic/dpx-pngimage.c#L154

From what I read about pdf and png formats, this is not necessary for non-interlaced pngs without an alpha channel (transparency) because in that case the png data stream and pdf image stream are completely compatible. Here's a code that takes advantage of this for reference: img2pdf.

There is also another advantage to copying the data directly, besides saving time, since it is possible to optimize png compression by tools like oxipng and the decompression/compression cycle breaks it.

It would be easy enough to test if the png format is compatible and just copy the data directly.

But there are some issues like gamma correction and color spaces and I am not too familiar with those.

rekka commented 5 years ago

@malbarbo Now when #453 is merged, could you try rerunning the test on your machine with the version on master branch? Please note that -r is the number of tex runs after the first, so your test was running tex twice (and also bibtex). Therefore run

time tectonic -r 0 paper.tex

Still runs bibtex but that's quick anyway. -c minimal is unnecessary.

On tests/xenia/paper.tex, I am getting the opposite in fact, tectonic is almost twice (!?) as fast as xelatex:

$ hyperfine 'tectonic -r 0 paper.tex' 'xelatex -interaction=batchmode paper.tex'
Benchmark #1: tectonic -r 0 paper.tex
  Time (mean ± σ):     552.0 ms ±  29.5 ms    [User: 451.6 ms, System: 92.9 ms]
  Range (min … max):   520.9 ms … 630.0 ms    10 runs

Benchmark #2: xelatex -interaction=batchmode paper.tex
  Time (mean ± σ):     925.4 ms ±   8.7 ms    [User: 863.7 ms, System: 94.3 ms]
  Range (min … max):   911.2 ms … 941.3 ms    10 runs

Summary
  'tectonic -r 0 paper.tex' ran
    1.68 ± 0.09 times faster than 'xelatex -interaction=batchmode paper.tex'

macbook pro Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz.

xelatex version

$ xelatex --version
XeTeX 3.14159265-2.6-0.999991 (TeX Live 2019)
kpathsea version 6.3.1
Copyright 2019 SIL International, Jonathan Kew and Khaled Hosny.
There is NO warranty.  Redistribution of this software is
covered by the terms of both the XeTeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the XeTeX source.
Primary author of XeTeX: Jonathan Kew.
Compiled with ICU version 63.1; using 63.1
Compiled with zlib version 1.2.11; using 1.2.11
Compiled with FreeType2 version 2.9.1; using 2.9.1
Compiled with Graphite2 version 1.3.13; using 1.3.13
Compiled with HarfBuzz version 2.3.1; using 2.3.1
Compiled with libpng version 1.6.36; using 1.6.36
Compiled with poppler version 0.68.0
Using Mac OS X Core Text and Cocoa frameworks

Just for fun with -r 1. Still faster!

$ hyperfine 'tectonic -r 1 paper.tex' 'xelatex -interaction=batchmode paper.tex'
Benchmark #1: tectonic -r 1 paper.tex
  Time (mean ± σ):     808.1 ms ±  10.2 ms    [User: 661.3 ms, System: 139.0 ms]
  Range (min … max):   794.3 ms … 824.1 ms    10 runs

Benchmark #2: xelatex -interaction=batchmode paper.tex
  Time (mean ± σ):     934.8 ms ±  17.0 ms    [User: 878.6 ms, System: 92.3 ms]
  Range (min … max):   917.7 ms … 972.2 ms    10 runs

Summary
  'tectonic -r 1 paper.tex' ran
    1.16 ± 0.03 times faster than 'xelatex -interaction=batchmode paper.tex'
malbarbo commented 5 years ago

@rekka Thanks for working on this, you got really incredible results!

I noticed later that I should use -r 0 instead of -r 1. Anyway, I got the following results from master:

> time tectonic -C -r 1 paper.tex
real    0m1,475s
user    0m1,268s
sys     0m0,208s

and

> time tectonic -C -r 0 paper.tex
real    0m0,950s
user    0m0,839s
sys     0m0,109s

and

$ hyperfine 'tectonic -r 0 paper.tex' 'xelatex -interaction=batchmode paper.tex'
Benchmark #1: tectonic -r 0 paper.tex
  Time (mean ± σ):     946.0 ms ±   3.9 ms    [User: 838.8 ms, System: 107.6 ms]
  Range (min … max):   941.5 ms … 952.5 ms    10 runs

Benchmark #2: xelatex -interaction=batchmode paper.tex
  Time (mean ± σ):     905.1 ms ±   6.5 ms    [User: 863.4 ms, System: 160.5 ms]
  Range (min … max):   895.2 ms … 917.5 ms    10 runs

Summary
  'xelatex -interaction=batchmode paper.tex' ran
    1.05 ± 0.01 times faster than 'tectonic -r 0 paper.tex'

Although I could not get tectonic to run as fast as xelatex, definitively https://github.com/tectonic-typesetting/tectonic/pull/453 improved the results.

Trying some files which I have performance problems using tectonic I found one extreme case that is still not solved by https://github.com/tectonic-typesetting/tectonic/pull/453: using beamer with metropolis theme. Tectonic takes 42s seconds to process the following file

\documentclass{beamer}
\usetheme{metropolis}

\begin{document}
\begin{frame}
    Test
\end{frame}
\end{document}

And the culprit is the digest calculation:

   83.17%  tectonic         tectonic                  [.] sha2::sha256_utils::compress256
    3.53%  tectonic         tectonic                  [.] get_next

flamegraph

rekka commented 5 years ago

@malbarbo Thanks for checking. Glad to hear that for tests/xenia the performance now matches xelatex on your machine.

OK, so just to check the difference between mac and linux, I ran the test again on Ubuntu 18.04 with Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (hyperfine supports markdown tables!)

Command Mean [ms] Min [ms] Max [ms] Relative
tectonic -r 0 paper.tex 375.2 ± 5.0 369.7 383.1 3.4
xelatex -interaction=batchmode paper.tex 110.2 ± 1.9 107.7 115.0 1.0

What is going on with xelatex?? So on my linux machine tectonic is significantly slower than xelatex. Tectonic speed improves because it's just a faster cpu, but why is xelatex so much faster than on my mac? I mean, it's an older version:

XeTeX 3.14159265-2.6-0.99998 (TeX Live 2017/Debian)

I am quite shocked by the performance for the metropolis theme. I use it for my presentations with Fira Sans fonts and on both mac and linux it performs just fine. This is the linux machine:

Command Mean [s] Min [s] Max [s] Relative
tectonic -r 0 metropolis-test.tex 1.206 ± 0.027 1.177 1.253 1.9
xelatex -interaction=batchmode metropolis-test.tex 0.641 ± 0.004 0.635 0.645 1.0

sha2 computation takes about 5% of the runtime.

Here's the flamegraph:

flamegraph

Is there something particular to your setup?

rekka commented 5 years ago

@malbarbo In case you want to try it, I replaced sha256 with xxHash for file change tracking on my branch https://github.com/rekka/tectonic/tree/feat-xxhash.

malbarbo commented 5 years ago

The Fira fonts was not installed in my system. Installing the fonts reduced the time to 5.3s (sha2::sha256_utils::compress256 took 6% of the time).

Using the feat-xxhash branch the time is reduced to 5s (twox_hash::sixty_four::XxHash64::write took 0,19% of the time, so digest time calculation is minimal).

When using the feat-xxhash branch without Fira fonts installed I got 7.5s. In this case, twox_hash::sixty_four::XxHash64::write took 9% of the time, so maybe there are more situations where we can avoid computing the digest.

All these numbers corresponds to running tectonic without -r arguments (tex run twice).

rekka commented 5 years ago

Thanks for checking; this is very useful.

It must be then reading a lot of font data if Fira is not installed. Hm.

Yes, we almost certainly do not need to compute the digest of system fonts. I'll try to look into disabling that too.

ratmice commented 5 years ago

FWIW, here is some output from a 300 page extraction of source code comments, comparing running xelatex twice, with tectonic. This is a random output that was lying around i.e. possibly buggy work in progress TeX output in some unknown state, but renders speed I typically see... tex source

Command Mean [s] Min [s] Max [s] Relative
./foo.sh 15.564 ± 0.084 15.337 15.633 1.0
tectonic ~/lean/mathlib/docs_out/mathlib.tex 26.647 ± 0.066 26.583 26.775 1.7

flamegraph

rekka commented 5 years ago

Thanks for the report! It's curious that I can't really see any part of the flamegraph that would be caused by tectonic's overhead so it is a mystery to me why it performs so much worse than xelatex (on linux only as far as I can tell).

ratmice commented 5 years ago

One thing I notice (at least on @rekka 's flamegraph branch), is that when building tectonic/*.c using cargo build -vv --release is the use of -fno-omit-frame-pointer which at least accounts for some of the differences I see between the disassembly of xelatex's getxtoken and tectonics get_x_token.

I'm not sure if this is just due to the debug info settings in the flamegraph branch. Add -fno-omit-frame-pointer with debug info As well as here: Setup for profiling

On the master branch running cargo build -vv --release doesn't cause -fno-omit-frame-pointer, to be passed to cc as such the results are almost reversed, perhaps that can be explained by xetex building with -O2 and tectonic compiling the c sources with -O3?

Command Mean [s] Min [s] Max [s] Relative
./foo.sh 15.622 ± 0.093 15.414 15.759 1.6
tectonic ~/lean/mathlib/docs_out/mathlib.tex 9.585 ± 0.030 9.554 9.634 1.0
pkgw commented 5 years ago

@ratmice Good sleuthing!

rekka commented 5 years ago

@ratmice Oh, that's a good point. I wasn't sure how much the extra debug info needed for creating the flamegraph changes the timing. Thanks for testing it.

Could you post the xelatex version and system you're running it on? (xelatex --version)

I observe that the run times are like this: xelatex on Ubuntu (texlive 2017) is much faster than tectonic, which is faster than xelatex on macOS (texlive 2019).

ratmice commented 5 years ago

@rekka Fedora 29 x86-64, XeTeX 3.14159265-2.6-0.99999 (TeX Live 2018) here are the build logs for texlive-base, they are a bit long but a search for xetex-xetex should bring up the appropriate flags.

One thing worth mentioning is that on x86-64, perf (and thus flamegraph) is capable of running without -fno-omit-frame-pointer by building the frame pointer from the dwarf, and .eh_frame, so it is possible to profile with less overhead. I can even profile the system package of xetex, by installing the separate debuginfo files. In that case there is no overhead/code gen changes needed for profiling, perhaps profiling encounters less samples not sure.

If i get some time i'll try and build tectonic with -O2, and perhaps build flags the fedora package uses, to get a closer comparison,

but the versions probably differ, and the package is patched so it'd probably be a better comparison to just build from the original version xetex that tectonic derives from if we want to subtract out the difference.

ratmice commented 5 years ago

I added an option to cc-rs pull req to allow us to configure whether frame pointers are omitted. I will try to remember to come up with a #[cfg(all(target_os = "linux", target_arch = "x86_64"))] or other custom cfg, in the branch's build.rs once there is a released version of cc-rs which has this.