jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.31k stars 3.31k forks source link

Feature request: Integrate a native PDF renderer #6861

Closed ad-si closed 2 months ago

ad-si commented 3 years ago

This is the one big thing I'm still missing from Pandoc: An easy, cross-platform way to generate PDFs, without having to rely on any external dependencies. I understand that this will be massive undertaking, but I think even a simple implementation, which only supports to print some simple text or graphics would already be really helpful.

Rasterific (https://github.com/Twinside/Rasterific) looks like it could be a good library to achieve something like this.

jgm commented 3 years ago

Well, even with this library, it's a pretty massive undertaking you're talking about -- manually laying out text in a PDF. Not to mention math layout and the complexities that brings.

alerque commented 3 years ago

This is out of scope.

PDF is a different case than every other format Pandoc handles. Needing external dependencies to handle it makes perfect sense. From another perspective, all document formats that Pandoc handles require external dependencies to render.

Even markdown requires some form of text editor that handles things like line wrap (layout and typesetting) to or conversion to another format for rendering.

Why should PDF be any different? The only difference is expecting a pre-renedered output with layout and typesetting work done already. Just because the final viewing step is separated from the layout and typesetting steps doesn't mean it should get special treatment. Pandoc is not a layout engine and does not do typesetting. It is a document format conversion tool. Trying to make it do layout and typesetting would be wildly out of scope, out of character, and frankly just not that feasible.

If you want lightweight PDF renderers that do layout and typesetting there are lots to choose from. They all ave different strengths and weaknesses because this is a huge job with lots of decisions to make that are not part of the document content. Take it from someone who writes layout and typesetting tools, this is not something that should be shoehorned into Pandoc.

ad-si commented 3 years ago

Ok, I see your point. But what about outputting PostScript then? You would need Ghostsciprt or similar to render it. So it'd more similar to the languages you enumerated.

My usecase would be converting from Markdown to PDF. So just some headings and text blocks. Think contracts, letters, text only ebooks, …. I'd be happy with even the most basic implementation.

alerque commented 3 years ago

PDF files are basically just PostScript with some fancy trappings. The same argument would apply to PostScript: in order to generate it you would have to convert raw document content (the Pandoc AST) to a rendered form that has all the physical shape (layout and typesetting) done. This requires things like canvas size, fonts, text shaping, line breaking, styling, and so on and so forth. None of these things are the purview of a document conversion tool.

My usecase would be converting from Markdown to PDF. So just some headings and text blocks. Think contracts, letters, text only ebooks, …. I'd be happy with even the most basic implementation.

Okay, so use a light weight layout engine. I don't think you realize how complex the "simple" cases you are talking about can be, but there are a number of options for doing page layout and typesetting whether from Markdown directly or from one of many formats that Pandoc converts to.

ad-si commented 3 years ago

Ok, I thought PS might also have some more high level constructs.

there are a number of options for doing page layout and typesetting whether from Markdown directly or from one of many formats that Pandoc converts to.

I think I tried out most of them by now, and all of them have some considerable issues. I guess pdfroff is probably the most lightweight and robust solution at the moment.

jgm commented 3 years ago

I think the most promising Haskell library for this purpose is HPDF, which includes some functions that fill boxes with text. https://hackage.haskell.org/package/HPDF-1.5.1/docs/Graphics-PDF-Documentation.html I've thought about this before. The problem is that even something as simple as handling a two-page document, where we'll need to split the text of a paragraph into two boxes, is still pretty complex.

jgm commented 3 years ago

Well, maybe fillContainer from HPDF can be used to fill to the end of page and return a new container and the remaining text. I might have to try it out.

jgm commented 3 years ago

I fooled around a bit and got some text laid out with this:

{-# LANGUAGE OverloadedStrings #-}
module Main where
import Graphics.PDF
import qualified Data.Text as T
import Data.List (intersperse)
import Debug.Trace

main :: IO ()
main = do
  let rect = PDFRect 0 0 600 400
  Just timesRoman <- mkStdFont Times_Roman
  runPdf "test.pdf" standardDocInfo rect $ do
    theDoc timesRoman

theDoc :: AnyFont -> PDF ()
theDoc font = do
  page1 <- addPage Nothing
  drawWithPage page1 $ drawing font

drawing :: AnyFont -> Draw ()
drawing font = do
  let black = Rgb 0 0 0
  let white = Rgb 1 1 1
  let hsty = Font (PDFFont font 26) white black
  let hrect = Rectangle (100 :+ 320) (500 :+ 360)
  displayFormattedText hrect NormalParagraph hsty $ heading
  let psty = Font (PDFFont font 16) white black
  let prect = Rectangle (100 :+ 100) (500 :+ 300)
  let vboxes = getBoxes NormalParagraph psty para
  let verstate =
          VerState { baselineskip = (12, 0.17, 0.0)
                   , lineskip = (3.0, 0.33, 0.0)
                   , lineskiplimit = 2
                   , currentParagraphStyle = NormalParagraph }
  let (dr, newc, vboxes') = fillContainer
         verstate
         (mkContainer 50 300 100 100 1)
         vboxes
  dr
  trace (show $ containerContentHeight newc) (return ())
  let (dr', _, _) = fillContainer
         verstate
         (mkContainer 50 (300 - containerContentHeight newc - 20) 200 100 1)
         vboxes'
  dr'
  -- displayFormattedText prect NormalParagraph psty $ para

heading  :: TM StandardParagraphStyle StandardStyle ()
heading = do
  paragraph $ do
    startPara
    sequence $ intersperse (glue 5 2 2)
                (map txt $ take 2 $ T.words lorem)
    endPara

para  :: TM StandardParagraphStyle StandardStyle ()
para = do
  setJustification FullJustification
  setBaseLineSkip 20 1 1
  paragraph $ do
    startPara
    sequence $ intersperse (glue 4 4 4 >> txt " ") (map txt $ T.words lorem)
    endPara

lorem :: T.Text
lorem = "Nisi cömmodo arcu, vitae cursus neque ante sed elit. Sed sit amet erat. Phasellus luctus cursus risus. Phasellus ac felis. Proin nec eros quis ipsum pellentesque congue. Curabitur et diam sed odio accumsan cursus.  Pellentesque ultricies. Quisque aliquam. Sed nisi velit, consectetuer eget, dictum ac, molestie a, magna. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Curabitur consequat leo et dui.  Aenean ligula mi, dignissim ut, imperdiet tristique, interdum a, dolor."

This shows how you can fill a rectangle as much as possible, and get a list of the remaining vboxes to fill another rectangle (which is what you need to do at a page break).

cagix commented 3 years ago

feels to me like re-inventing tex ... starts easy and limited and in the end we get a pandoc-latex ...

Delanii commented 3 years ago

I think it is great that there is Haskell development in this area. It is good to have options. Thank you for pointing the Haskell library out. I am still pretty new to TeX, and even as it is great tool, probably unparallel still, it has some sore spots that most probably wont be solved in near future (a few of which I have read: Grid typesetting, dealing with whitespace "rivers," dealing with repeated words at the end or beginning of a line, or creating a different line-breaking or page building algoritm). Not that I understand any of this deeply. But maybe, a new tool could take a different look at these issues, someday ... However, for pandoc project this to me also looks like out of scope. Maybe, if someday there would be a big enough "library" to just integrate in pandoc ... ?

mb21 commented 3 years ago

I agree it's out of scope for pandoc – better to leave this concern to a separate program – and we support already quite a few pdf-engines.

And automatic layouting and typesetting is indeed a very difficult problem (kerning, widows, orphans, hyphenation using language dictionaries, etc. etc.), which is part of the reason TeX is still in use (of all the open source engines, it still produces the best typographic output).

While pandoc happily supplies the semantic markup to those programs, people will always want to send layout instructions along as well. That's where a custom LaTeX template or CSS comes in. Personally, I feel CSS is a much nicer way to declaratively instruct a pdf engine on layout customizations – but browser vendors don't care about pages and care more about not doing too many passes (CSS flex-box takes 2 passes to layout, CSS grid 3) than optimal typography, and the other open source implementations are currently all still somewhat lacking.

Anyway, guess that's not the OP's use-case either. So yes, what's wrong with pdfroff if you just want "some PDF" and don't care much how it looks.

jgm commented 3 years ago

I think it could be in-scope, potentially. I can see the advantages to being able to render PDF without external tools. Note that HPDF uses the hyphenation library, which implements the Knuth-Liang hyphenation algorithm, so its output is not too bad. It offers full control over kerning and glue, like low-level tex. If we did want to go in this direction, it would probably be worth creating a library that handles some of the lower-level details.

One worry is that the original creator of HPDF hasn't done anything on the project since 2016. Someone else has taken it over and seems to be maintaining it now, so maybe that's okay (though I note they've disabled issues and PRs on the repository, not a great sign). But one might worry about depending on it.

In my experimentation, the main stumbling block I see is with fonts. Using the built-in Times New Roman, Helvetica, and Courier (which probably only support the latin1 glyphs) is too limiting. I tried loading a type 1 font with the included functions, but had no success yet. This also requires file paths to .pfb and .afm files; we'd need something higher level that gets system fonts on all the major platforms.

ad-si commented 3 years ago

feels to me like re-inventing tex ... starts easy and limited and in the end we get a pandoc-latex ...

You make it sound like this would be bad. It's probably the best thing that could happen to tex 😛.

cagix commented 3 years ago

feels to me like re-inventing tex ... starts easy and limited and in the end we get a pandoc-latex ...

You make it sound like this would be bad. It's probably the best thing that could happen to tex 😛.

well, don't get me wrong: i would very much welcome a pdf generator embedded in pandoc, especially since it would eliminate the need to install other tools.

but in reality, this is quite an effort to do something that other (already existing) tools simply do better. my bet would be that this would start small with just a few features, but it would soon get attention and requests to do this or that and to support package xyz ... in the end, the quality of the generated pdfs will inevitably be compared to latex or other tools.

i would rather stick to the good old unix tradition: a tool should do just one task and do it well. then it can be combined with other tools to achieve something bigger ... and using docker there is no need to maintain a latex installation anymore ...

jgm commented 3 years ago

Well, I'm currently stuck on fonts. If HPDF allowed loading of TrueType fonts, then I think there'd be potential here. I can see the advantages of something that doesn't require external tools and is configurable in a simpler way than LaTeX. And my tinkering self likes the idea of controlling the whole typesetting from top to bottom.

However, I can't even get type 1 fonts working, so I'm stuck. I don't know how hard it would be to add truetype to HPDF; maybe someone would like to take that on.

ickc commented 3 years ago

You make it sound like this would be bad. It's probably the best thing that could happen to tex 😛.

One example of reinventing LaTeX is https://github.com/sile-typesetter/sile . Years ago someone mentioned they want to develop a pandoc writer to write to this language and convert to PDF.

tarleb commented 3 years ago

SILE writer PR draft by @alerque: #6088

mb21 commented 3 years ago

Interesting point from the SILE Manual:

At this point, the parts of TeX that people actually use are

  1. the box-and-glue model,
  2. the hyphenation algorithm, and
  3. the line-breaking algorithm.

Though looking at this screenshot, seems like SILE's typographic output quality is still somewhat lacking...

image
alerque commented 3 years ago

@ickc / @tarleb Yup yup. I didn't mention it in the discussion above because it's not a candidate for a built-in typesetter in a Haskell environment, but when I say typesetting is more complicated than people think my opinion is based on considerable experience in trying to make it simpler!

@mb21 Fair point — but allow me point out the that line space glitch is a known bug (see SILE issues №560 and №860) related to floating figures. That, and its corollary dropcaps (see SILE issue №394), are very sticky issues we haven't shaved off the rough edges from yet. The "three things people actually use" (and much more besides) work pretty well and there are several publishing companies using it exclusively for book publishing workflows (including drop-caps, but with extra care!). It's also been used for Unicode proposals and other tricky stuff.

jgm commented 3 years ago

At this point, the parts of TeX that people actually use are

  • the box-and-glue model,
  • the hyphenation algorithm, and
  • the line-breaking algorithm.

I believe that HPDF has all of these things. The one thing it doesn't have is reasonable font handling.

jarnosz commented 3 years ago

Is a port of TeX really necessary? What about a simpler, text2pdf routine, with only paragraph and page breaking, without the hyphenation and box&glue, leaving the text left flushed? Otherwise, I fear it would be easier to implement pandoc in LuaTeX.

ickc commented 3 years ago

I think the only concrete option mentioned above is to use HPDF. Is not a full port as you said which is unnecessary.

Another thing to mention is the ability to cross compile to JS/web assembly is nice to have. Currently people has been able to cross compile to web assembly, and people in the past has been able to compile to JavaScript when pandoc has fewer non Haskell dependencies. One issue here is tracking non Haskell dependencies.

jgm commented 3 years ago

Box and glue and hyphenation are already handled pretty well by HPDF. As noted above, the thing HPDF doesn't handle well is fonts. That's what blocks progress.

EDIT: To expand on this: If you just use the latin1 character set and you don't mind using the standard fonts, HPDF is okay. But that's just not enough for pandoc. We need to support multilingual content and math characters.

mb21 commented 3 years ago

btw. depending on your use-case, you can also just export to HTML and the do print-to-PDF in a browser. That's basically what the "print" button in the PanWriter app's preview does (because it's an electron app, it ships with a browser).

But yes, if for some reason you need this functionality bundled into the pandoc binary, then integrating HPDF into pandoc would probably be the best way...

nathanlesage commented 3 years ago

Hey, just been directed here while talking to Albert about bundled PDF support, and I'd like to also weigh in on the possibility of having a very simple built-in PDF converter. The "print"-button in Electron-apps is a subpar alternative, and standalone libraries for PDF-generation are actually really sparse. My use-case would be, since we're now bundling Pandoc with Zettlr, to offer a boiled-down PDF generation option for people who don't want to install anything additionally. And I think RStudio might also benefit from this …? Not sure though as I'm on Python …

However, I see the issues with font handling (don't know Haskell, but Rust should be comparable from the pure mechanics of font-file handling) and that it's super difficult to get this working. So I can fully understand that this does not have priority, but if that comes one day, I'd be super happy! :)

And, btw, to those fearing that this is could open the gates to a flood of feature requests, one could wrap the "new box, new image, new page" configuration options into a small API that can be controlled using LUA scripts …? This way everyone could implement their own logic if they so wish without putting any burden here!

mb21 commented 3 years ago

The "print"-button in Electron-apps is a subpar alternative

Could you elaborate on that? You think the quality of the PDF is lacking (not sure if we could hope for anything better by rolling our own), or is it because of the UX of the print dialog, or...?

nathanlesage commented 3 years ago

The "print"-button in Electron-apps is a subpar alternative

Could you elaborate on that? You think the quality of the PDF is lacking (not sure if we could hope for anything better by rolling our own), or is it because of the UX of the print dialog, or...?

It's the UX - if you want to produce a PDF you expect to press a button and be done. With Electron's print abilities, however, this requires opening a separate window, and overall I think that "printing" is not completely the same as rendering. It does work, but it's, as I said, somehow subpar.

dalai4git commented 3 years ago

I would love to have a native PDF writer in pandoc. In addition to the issue with the truetype fonts, I would like to point out a few more things that will need to be considered:

Of course some of that customization can be left out, provided using filters or using external programs, but this way the advantage of a built-in implementation is slowly lost.

nathanlesage commented 3 years ago

The specification of the content of the title page. A basic title, author, date will be sufficient for some, but many use cases require more customization. Similarly for the content of the headers and footers.

I think here we enter the realm which the integrated PDF writer should (in my opinion) not fulfill. Headers and Footers, yes, a basic title page, yes, but not much above that. For that, there are already a lot of good external writers out here.

Remember: This issue is about adding something so that in many simple situations you don't need a full TeX installation.

I guess, in general we should have a list of stuff that should be included and what won't be included, and which could be defined as just too much, because any additional feature here will add to the maintenance costs of the team, which we shouldn't overstrain. I think @jgm and @tarleb et al. should set this (because they have a better overview over what's possible and what not).

including vector-based such as EPS/SVG

EPS is a dead and gruesomely complicated format (source: my mother is a graphics designer, nobody uses that anymore)

Math?

Guessing from my own experience, I'd say that would be a requirement for A LOT of people. Luckily, given good fonts, that can be done using fonts only (if I remember correctly, Pandoc already does this if MathJax/KaTeX is not available …?)

dalai4git commented 3 years ago

I think here we enter the realm which the integrated PDF writer should (in my opinion) not fulfill. Headers and Footers, yes, a basic title page, yes, but not much above that. For that, there are already a lot of good external writers out here.

Sure, but even basic styling and content adjustment is non trivial. If the CSS route is taken, the converter will need to be able to parse it and do something meaningful with it. For every thing that is not supported, someone will open a bug about it. I am not familiar with paged media, but can I specify that I want the page number on the header and a date on the footer? If not, where would that happen? Or that I want Chapter instead of Section (or Kapitel or Chapitre)

EPS is a dead and gruesomely complicated format (source: my mother is a graphics designer, nobody uses that anymore)

Maybe nobody uses it in graphics design, but in other areas it could be the format used by some older specialized software. Even if EPS is out of scope, SVG is also non trivial to parse and embed in a PDF. Asciidoctor-pdf leverages a 3kloc library for rendering SVGs into PDFs, but can't use the SVGs from draw.io directly without some minor adjustments.

There are also other things I forgot to mention. The converter will need to be able to layout tables with all their complications, e.g. borders, column widths, multirow or multicolumn cells, page breaks in the middle of a table, page breaks in the middle of a tall row, etc. Footnotes seem to be also non-trivial, see pagedjs or asciidoctor-pdf. Some will also expect generated table of contents or lists of tables and figures.

Remember: This issue is about adding something so that in many simple situations you don't need a full TeX installation.

In my opinion "simple situations" is difficult to define. For physical sciences math is part of simple, for other disciplines it is footnotes, for legal it could be support for 2 columns and for business customizable headers and footers or generation of table of contents. I don't want to be negative, but this is a big undertaking even if the scope is limited.

Anyway, maybe the maintainers have a better idea on how to move this forward. I just wanted to give some of my experience of generating PDFs using asciidoctor as a user.

nathanlesage commented 3 years ago

Sure, but even basic styling and content adjustment is non trivial.

Yes, but: If we go down the CSS route it could be extremely easy IF (!) and only if there is some Blink-port or something similar. If there isn't and we would need to implement CSS parsing ourselves (rather: The pandoc team because I'm dumb when it comes to Haskell) then it's gruesome, and I would advise against it.

For every thing that is not supported, someone will open a bug about it.

Except, you clearly put that into the docs. Then you can easily close as "Out of scope; use LaTeX" or similar.

Maybe nobody uses it in graphics design, but in other areas it could be the format used by some older specialized software.

Exactly, but we should not support outdated formats (just as Pandoc, for instance, is not built in 32 bit, even though it would be trivial to implement) to not foster dependence on those.

Even if EPS is out of scope, SVG is also non trivial to parse and embed in a PDF.

Absolutely, but one format less is one format less to implement.

In my opinion "simple situations" is difficult to define.

I totally agree. Which is why I explicitly stated that the Pandoc team should simply make a decision and then be done with it. After all, if such a bundled PDF writer comes, it's a nice thing of the Pandoc guys, not something that would be absolutely necessary. I think they will be able to do a good decision. In the end, many people would be happy even with a very simple, lossy, PDF generation at first. We shouldn't push boundaries too much here.

But, alas, I'm not able to code anything for Pandoc, so I'll leave my two-cent-comments with this and keep looking forward to a simple integrated PDF writer :)

jgm commented 3 years ago

If you want to style with CSS, just use the wkhtmltopdf backend. I don't see a point in reproducing its functionality in pandoc. (That would be a tremendous amount of code.)

If we ever do add a native PDF renderer, it's going to be somewhat basic and limited. But I think it's a nonstarter unless HPDF gets better font support, and since HPDF seems to be a more or less dead project now, I don't see much hope of that. It is also an important point that laying out footnotes and tables is nontrivial.

jgm commented 3 years ago

I note there's a font-related change in the latest version of HPDF: https://hackage.haskell.org/package/HPDF-1.5.2/changelog Maybe this fixes the issue I was having before -- worth checking. [answer is no]

ad-si commented 2 months ago

For future reference:

My preferred way of rendering PDFs is now --pdf-engine=typst. Typst is easy to install, fast, generates nice PDFs, and actively maintained.