Embed text rather than converting text to paths

jonmmease commented 1 year ago

Hi, it looks like usvg 0.28.0 recently added the ability to avoid converting text to paths during the simplification process: https://github.com/RazrFalcon/resvg/blob/master/CHANGELOG.md.

I was hoping this would open the possibility of having svg2pdf do the same, by embedding text in the resulting PDF rather than converting text to paths.

For context, I'd be interested in using svg2pdf to add PDF support to the VlConvert Rust library (which already supports SVG and uses resvg for PNG support). But having true text embedded in the resulting PDF is important for accessibility. Thanks!

reknih commented 1 year ago

Hey!

Accessibility in PDFs is an important issue and drawing paths instead of texts is not optimal. However, I do not think svg2pdf will implement text embeds in PDF for a while: We created this crate for Typst, a scientific typesetting app. Because embedding fonts in PDFs and rendering text is complex and Typst has its own text stack, we wanted to eventually enable svg2pdf to tell the caller: "You can find the graphic in this XObject and you should superimpose this text..." That way, we can make use of our existing implementations and keep the code DRY.

Now I realize that this is unsatisfactory. For that reason, I want to propose an alternative solution: Would enclosing text in a marked content sequence and setting the text content as /ActualText suffice for your use case?

If not, I am not against considering a PR that uses ttf-parser and rustybuzz to render the text into the PDF but I cannot prioritize building this feature for a while.

jonmmease commented 1 year ago

Hi @reknih, thanks for the thoughtful response! I totally appreciate that full text embedding would be a complex endeavor.

This proposal sounds very interesting, though I'll be honest that I'm not familiar enough with PDF internals to understand all of the implications. Would text be selectable the way? Would it be similar to a document that is scanned with ORC, where you see the scanned image (or paths in our case) but can still select the text?

IMO, the most important requirements for the data visualization use case are that the text can be read by screen reader and be selectable so that it's possible to copy and paste from the document.

Thanks!

LaurenzV commented 1 year ago

@jonmmease I might be wrong about this, but I believe /ActualText is only a way of "annotating" certain elements with a corresponding text, similar to the alt attribute in <img> in HTML.

From the reference:

Text that is an exact replacement for the structure
element and its children. This replacement text (which should apply to
as small a piece of content as possible) is useful when extracting the doc-
ument’s contents in support of accessibility to users with disabilities or
for other purposes (see Section 10.8.3, “Replacement Text”).

So I don't think it will make text selectable, but it will at least provide a way for screen readers to know what the text content is.

jonmmease commented 1 year ago

Thanks @LaurenzV, that makes sense.

So, I just did an experiment that was interesting (at least to me). Here is a PDF that I generated with svg2pdf from an SVG: stacked_bar_h.pdf

I ran this through OCR at https://avepdf.com/pdf-ocr and ended up with this PDF: stacked_bar_h_OCR.pdf

Everything looks the same but the text is now selectable. This may be naive, but I was wondering how hard it would be to directly generate a PDF similar to this, where of course we know what the text is rather than needing OCR.

LaurenzV commented 1 year ago

I'm not sure honestly, @reknih has a much better insight into how PDFs work than me so maybe he can share his thoughts on how difficult that is. But the first step would in any case be to upgrade the library to version 0.28 of usvg (this is what I'm currently trying to do and it's not trivial because the library introduced some breaking changes to its data structures, and I'm also only doing this on the side out of fun, so I can't promise when I will get it done and if I will get it done at all, unfortunately).

Could you send the correspoinding svg file for that PDF though? Would be good to use as a test case if I do take a stab at implementing this.

jonmmease commented 1 year ago

Sure! Here's the original SVG:

stacked_bar_h

reknih commented 1 year ago

This proposal sounds very interesting, though I'll be honest that I'm not familiar enough with PDF internals to understand all of the implications. Would text be selectable the way? Would it be similar to a document that is scanned with ORC, where you see the scanned image (or paths in our case) but can still select the text?

Depending on the implementation of the reader, the text would likely be selectable as a whole, instead of single glyphs, with the possibility to copy and past all of it, but not fragments. The text would be accessible for screen readers. /ActualText is intended mark what text some graphics actually contain, instead of just describing visuals (i.e. for spiders and users with visual impairment, like the alt tag in HTML).

On a full text implementation: usvg has its own text stack and so does Typst. The difference is that usvg is not aware of PDF. Hence there are two problems: laying out text and embedding the fonts in the PDF. Both would be non-trivial.

reknih commented 1 year ago

I'm not sure honestly, @reknih has a much better insight into how PDFs work than me so maybe he can share his thoughts on how difficult that is. But the first step would in any case be to upgrade the library to version 0.28 of usvg (this is what I'm currently trying to do and it's not trivial because the library introduced some breaking changes to its data structures, and I'm also only doing this on the side out of fun, so I can't promise when I will get it done and if I will get it done at all, unfortunately).

We will eventually update to the current version of usvg. The problem is that usvg and Typst both depend on rustybuzz, with the latter depending on an API that was dropped in an update of the library. Because usvg updated its dependency and we want to only ship one version of rustybuzz with Typst, we either need to reimplement the API on our side or talk to RazrFalcon, the maintainer of both crates.

jonmmease commented 11 months ago

First off, congrats on the release of typst! It's a really impressive piece of technology and it's awesome that you chose to open source it.

I was thinking about this (SVG) -> (PDF w/ Selectable Text) problem again today and wanted to check in with you all on what your current thinking is for tackling the problem down the road (assuming it's still somewhere on the roadmap).

I know you mentioned earlier that svg2pdf's functionality might be moved into typst in the future in order to take advantage of typst's text layout functionality and I was wondering if that's still the direction you'd like to go vs adding text embedding support to svg2pdf.

Thanks!

laurmaedje commented 11 months ago

Generally, that is still what I think would be best. But Typst's intermediate representation (frames) would need quite a bit of extension to cover SVG, so I'm not sure when it can happen.

Another option would be to handle only text specifically in Typst, but that doesn't help other consumers of svg2pdf (well, moving it fully into Typst wouldn't either ...) and it would also only help in the case of "simple" text that's not affected by complicated transforms, clip paths, etc.

jonmmease commented 11 months ago

Thanks @laurmaedje, that makes sense.

Since svg2pdf is very close to what I need for my application (a standalone conversion library that can be embedded in another Rust library), I started playing with updating it to use PDF text. I was able to get the basics of something working after a couple of hours with things hardcoded to a built-in font (Helvetica). But after a bit of research I see what you mean regarding the complexity of embedding custom fonts into the PDF document. I also don't know if I'd want the file size overhead of font embedding for this use case.

An idea I'm considering is to continue rendering text as paths with usvg (as svg2pdf works today), and then overlay invisible text, using one of the built in fonts, on top of the path text. This would make the text appear selectable without requiring custom font embedding. When the SVG uses a PDF built-in font the text selection would line up exactly with the path-rendered text. When using a custom font, there would be some kind of heuristic to chose the closest built-in font and font-size to overlay on top.

For my application, a font with equivalent sizing to Helvetica is the default. And text strings are typically pretty short, so even for custom text I think a heuristic like this could do a pretty good job of making text selection line up. And even if they don't line up exactly, it's still a big improvement as it would be possible for screen readers to find the text and for viewers to copy the text.

I don't know if this scheme makes sense for svg2pdf in general, but I wonder if you would be open to me adding an extension point to svg2pdf so that I can start trying this out without forking the project. In particular, I think what I would want is the ability to run custom logic on the top-level content stream after the top-level XObject is invoked. This might be as simple as adding an optional callback function to the Options struct that's passed to convert_tree. The callback would get called with a mutable reference to content after this line:

https://github.com/typst/svg2pdf/blob/e873c3e2c2ddb5ec3977d73c85f4648521964ffa/src/lib.rs#L179C5-L179C12

Then my callback function could traverse a copy of the usvg Tree that hasn't had convert_text called on it and add invisible text to the stream. There would also need to be a way to tell the underlying writer to register the necessary built-in fonts, so this might be another entry in the Options struct.

Let me know what you think!

laurmaedje commented 11 months ago

I might be mistaken, but couldn't you do this without changes to svg2pdf by using the convert_tree_into API? Then, you control the main PDF content stream yourself and can reference svg2pdf's output as an XObject. You'd then overlay whatever you want on top of that XObject.

jonmmease commented 11 months ago

I might be mistaken, but couldn't you do this without changes to svg2pdf by using the convert_tree_into API?

Oh, that's a good idea! I hadn't thought of that approach. I'll give it a try

LaurenzV commented 11 months ago

Just FYI, I still have a pending draft PR which updates the usvg version and also contains a whole bunch of other improvements. It is pretty much done, I just need to test everything one more time and once I'm done with that, I'll mark the PR as ready for review and also give a description of what changed. I'll hopefully get to doing this at the start of September.

Just letting you know so in case you plan to make bigger changes to the API of svg2pdf, it would maybe be nice to do this once that update is merged to avoid merge conflicts. 😄

jonmmease commented 11 months ago

Thanks for the heads up @LaurenzV, and for doing the usvg update! VlConvert is on usvg 0.35 as well, so that update will be important for us. Based on a quick experiment, I think the convert_tree_into approach @laurmaedje suggested will work out really well for the time being, so as long as that API is still available after your updates I think I'm all set.

jonmmease commented 10 months ago

I'm making good progress using this approach to render everything except text with svg2pdf as an XObject using convert_tree_into and then overlaying it with PDF text with embedded fonts following typst's approach.

Would it be possible to cut a release of svg2pdf to crates.io now that #37 has been merged? Thanks!

laurmaedje commented 10 months ago

We can make a release.

@LaurenzV anything needed before a release from your side?

LaurenzV commented 10 months ago

No, I'm good to go. 👍 At some point we should probably start adding a changelog, but I think it's fine for now.

laurmaedje commented 10 months ago

Release is out now!

jonmmease commented 10 months ago

Thanks so much!

jonmmease commented 10 months ago

In case you're interested in what this if for, here's the PR I'm using svg2pdf for: https://github.com/vega/vl-convert/pull/97

VlConvert is a project that inputs Vega-Lite chart specifications and converts them to static images. These declarative chart JSON specs are often created by the Python library Vega-Altair, but they can also be written as JSON directly in an editor like this. With this PR, vl-convert will be able to output PDF files 🎉

Tangentially, it would be pretty neat if typst were able to directly embed Vega-Lite some day!

typst / svg2pdf

Embed text rather than converting text to paths #21