manubot / rootstock

Clone me to create your Manubot manuscript
https://manubot.github.io/rootstock/
Other
452 stars 175 forks source link

Use gotenberg for HTML to PDF conversion #396

Open dhimmel opened 3 years ago

dhimmel commented 3 years ago

Originally mentioned by @agitter at https://github.com/manubot/rootstock/issues/393#issuecomment-733068778, gotenberg is a:

Docker-powered stateless API for converting HTML, Markdown and Office documents to PDF

Since we're looking at replacing athenapdf with pagedjs-cli in https://github.com/manubot/rootstock/issues/394, it also makes sense to evaluate gotenberg.

Links:

dhimmel commented 3 years ago

One challenge is that the Docker image is large: 844 MB for thecodingmachine/gotenberg:6.3.1. This compares to 291 MB for arachnysdocker/athenapdf:2.16.0

dhimmel commented 3 years ago

conversion from a URL

First run the docker:

docker run --rm --publish 3000:3000 thecodingmachine/gotenberg:6.3

Second make an API call to export the manuscript

curl --request POST \
    --url http://localhost:3000/convert/url \
    --header 'Content-Type: multipart/form-data' \
    --form remoteURL=https://manubot.github.io/rootstock/v/97b294802ffcd39071b6e5b8ab59f60faf4be118/ \
    --output output/gotenberg.pdf

Result at gotenberg.pdf looks good (similar to athenapdf).

castedo commented 1 year ago

@dhimmel It looks like manubot has settled on using WeasyPrint for HTML -> PDF conversion. Is this correct?

In my current manubot-like workflow (but not manubot) I use pandoc to generate JATS XML from markdown and then I generate HTML and PDF from JATS XML as an independent stage. I'm starting to think generating both HTML and PDF from the same JATS XML is a mistake. I'm now considering doing just JATS XML -> HTML -> PDF using WeasyPrint.

Any advice?

(It's a long explanation why I'm not doing markdown directly to HTML).

agitter commented 1 year ago

I'm revisiting this after @vincerubinetti pointed out that athenapdf has been archived in https://github.com/manubot/rootstock/issues/254#issuecomment-1569088082

It may be time to look more seriously into pagedjs-cli versus gotenberg as an athenapdf replacement. Based on @dhimmel's old comment above, it looks like gotenberg worked in initial testing. The latest gotenberg image 7.8.3 is now somewhat smaller at 644MB.

castedo commented 1 year ago

FWIW, I've gone pretty far down the WeasyPrint path and gotten good results. I've gotten good results in large part because I'm careful to use fairly old HTML/CSS features. An example is the PDF link off this page: https://popgen.es/H5NOlCVM9P5Vv4LbeuwJsaME8kM/1.1/ The PDF is by WeasyPrint from a subset of the webpage content.

I have decoupled much of the HTML/CSS implementation from the above example into a separate project: https://gitlab.com/castedo/printstrap/ to help others do similarly with WeasyPrint.

In particular you might be interested in the article.html example on the article branch: https://gitlab.com/castedo/printstrap/-/blob/article/example/article.html

castedo commented 1 year ago

Also quick clarification: the article.html example in the article branch is actually much more advanced than the live example I give above on popgen.es today. The article.html example is a 2-column format kind of like eLife articles but is fully responsive with the PDF corresponding directly to the HTML content at a particular screen width.

castedo commented 1 year ago

This discussion might be helpful in evaluating Chromium vs not:

https://github.com/singlesourcepub/community/discussions/49

I've partly gone down the WeasyPrint path because I hesitate to rely on Chromium. I consider it an open question whether Chromium is the right tool for specialized HTML -> PDF conversion where the HTML is high constrained and not really a full web page of a website.

vincerubinetti commented 1 year ago

I consider it an open question whether Chromium is the right tool for specialized HTML -> PDF conversion where the HTML is high constrained and not really a full web page of a website.

I think Chromium is probably necessary. We need to rely on "newer" CSS properties sometimes, like overflow-wrap and word-break, which are not supported in Weasy. More importantly, we need to rely on JavaScript execution sometimes, like the attributes plugin way of merging table cells together. You could argue that we should find ways to statically do things at build time as much as possible, without javascript, but it would be a significant effort.

vincerubinetti commented 1 year ago

Maybe we should also emphasize somewhere in the docs that as a last resort, one can manually print to pdf from the html version in any major browser.