vega / vl-convert

Utilities for converting Vega-Lite specs from the command line and Python
BSD 3-Clause "New" or "Revised" License
89 stars 9 forks source link

PDF support #91

Closed jonmmease closed 10 months ago

jonmmease commented 10 months ago

The last major feature lacking from vl-convert that is possible with altair_saver and the Vega CLI is PDF export support. The primary motivating use case for PDF export is to support embedding charts in LaTeX in vector format (LaTeX doesn't support SVG as far as I understand).

Related work

altair_saver accomplishes this using a system web browser and selenium. This approach is not an option for vl-convert because it requires the external system dependency on a web browser, and we're working hard to keep vl-convert a standalone binary without dependencies on system applications or libraries. We also want to keep the core of vl-convert from depending on Python so that it's available as a standalone CLI binary and is easy to wrap in other languages.

The vl2pdf CLI accomplishes this using the node canvas createPDFStream method. Node canvas is not compatible with Deno, so this isn't an option for vl-convert either.

Background

vl-convert uses the Deno JavaScript runtime to export Vega/Vega-Lite charts to SVG images. The resulting SVG images are optionally converted to PNG using the resvg Rust library. resvg depends on an interesting library named usvg. usvg is an SVG simplifier. It converts all SVG constructs into a couple of primitives, the main one being SVG paths. This simplified SVG is then processed by resvg to render it to a PNG image.

svg2pdf

There is a really compelling Rust library called svg2pdf for converting SVG images to PDF documents. svg2pdf happens to rely on the same usvg library for performing SVG simplification before applying it's own logic to convert the simplified SVG to PDF. This works great, and the project has an impressive test suite that uses pdfium to convert the resulting PDFs to PNG images for comparison with baselines.

One notable feature of how this process works is that usvg is capable of converting text into geometric paths (and both resvg and svg2pdf take advantage of this). In the case of svg2pdf, this means that the exact geometry of the source font is preserved in the output document without the need to embed the font into the resulting PDF document. The downside of this is that the PDF viewer no longer recognizes text as text, so it's not possible to select and copy it. This also means that it's not possible for screen readers to locate the text. This is why I've been hesitant to incorporate svg2pdf up to this point.

Way forward

I spent some time this weekend digging into svg2pdf to try to identify the best path forward for adding text selection support to the resulting PDF documents.

The most obvious approach might be to stop the conversion of text to geometric paths and instead use PDF text. I tried this in a branch of svg2pdf, and it wasn't too difficult, but the issue is that for all but the "base fonts" (more on those later) this approach requires embedding the font into the resulting PDF document. This is a really complex process that's not abstracted by the pdf-writer library that svg2pdf depends on. This approach can also significantly increases the PDF document size unless the even more complex process of font subsetting is performed (font subsetting is the process of pruning a font to only include the characters that are actually used in the document).

After groking the complexity of the above approach, I went for a walk, and came up with an alternative. The PDF format specifies that all PDF readers must include built-in support for 14 "base fonts" including Helvetica, Courier, and Times New Roman in the regular, italic, bold, bold italic variants. Suppose we render all text as paths the way svg2pdf currently does, and then overlay the paths with the same text using a base font with opacity 0. This is similar to what OCR software does, where it overlays a scanned image with invisible text that can be searched and selected.

By happy coincidence, Helvetica is the default font used by vl-convert (with fallbacks to Arial and Liberation Sans which have near identical sizing). So in the majority case the overlaid text will be Helvetica and will exactly match the text drawn as paths. When a chart uses a font that is not one of the PDF base fonts, we'll have the task of choosing the "best" base font (and font size) to overlay on top of the text with custom font. I have some ideas on how to do this, and I'm pretty confident that we'll be able to do a good enough job here. And even if the overlaid text doesn't line up exactly, it's still a lot better than nothing as it's still possible to select and copy and for screen readers to find.

This approach avoids the complexity and file size increase of font embedding. After some experimentation, I found that it's also possible to do this today without any changes needed in svg2pdf. (we will need to wait for svg2pdf to be updated to usvg 0.35.0, but that's in progress).

cc some folks who use Vega in LaTeX in case you have feedback on this approach: @domoritz @arvind @kanitw @jheer @joelostblom @nicolaskruchten

joelostblom commented 10 months ago

I don't have too much technical input on this, but I think your suggested approach sounds feasible and like an effective way to solve the issues you highlighted. I agree that PDF support would be a nice feature to simplify not only for latex users but also those unfamiliar with the svg format.

Is svg2pdf a heavy dependency? It seems to be 7MB when I checked the repo size and vl-convert is only 6MB (but I'm not sure how big a full installation of vl-convert is currently).

jonmmease commented 10 months ago

Is svg2pdf a heavy dependency?

In terms of the binary itself, vl-convert is ~55MB uncompressed and ~22MB compressed as a wheel. The release compiled svg2pdf lib looks like it's around 2MB uncompressed. And vl-convert already includes some of the dependencies of svg2pdf, so I think the increase will be a very large percentage.

jonmmease commented 10 months ago

Update

I ended up working out how to do proper font embedding by following the logic in the typst project, which also uses the pdf-writer crate that svg2pdf relies on. PR is almost ready in https://github.com/vega/vl-convert/pull/97.