Open dhimmel opened 4 years ago
One challenge is that the Docker image is large: 844 MB for thecodingmachine/gotenberg:6.3.1. This compares to 291 MB for arachnysdocker/athenapdf:2.16.0
First run the docker:
docker run --rm --publish 3000:3000 thecodingmachine/gotenberg:6.3
Second make an API call to export the manuscript
curl --request POST \
--url http://localhost:3000/convert/url \
--header 'Content-Type: multipart/form-data' \
--form remoteURL=https://manubot.github.io/rootstock/v/97b294802ffcd39071b6e5b8ab59f60faf4be118/ \
--output output/gotenberg.pdf
Result at gotenberg.pdf looks good (similar to athenapdf).
@dhimmel It looks like manubot has settled on using WeasyPrint for HTML -> PDF conversion. Is this correct?
In my current manubot-like workflow (but not manubot) I use pandoc to generate JATS XML from markdown and then I generate HTML and PDF from JATS XML as an independent stage. I'm starting to think generating both HTML and PDF from the same JATS XML is a mistake. I'm now considering doing just JATS XML -> HTML -> PDF using WeasyPrint.
Any advice?
(It's a long explanation why I'm not doing markdown directly to HTML).
I'm revisiting this after @vincerubinetti pointed out that athenapdf has been archived in https://github.com/manubot/rootstock/issues/254#issuecomment-1569088082
It may be time to look more seriously into pagedjs-cli versus gotenberg as an athenapdf replacement. Based on @dhimmel's old comment above, it looks like gotenberg worked in initial testing. The latest gotenberg image 7.8.3 is now somewhat smaller at 644MB.
FWIW, I've gone pretty far down the WeasyPrint path and gotten good results. I've gotten good results in large part because I'm careful to use fairly old HTML/CSS features. An example is the PDF link off this page: https://popgen.es/H5NOlCVM9P5Vv4LbeuwJsaME8kM/1.1/ The PDF is by WeasyPrint from a subset of the webpage content.
I have decoupled much of the HTML/CSS implementation from the above example into a separate project: https://gitlab.com/castedo/printstrap/ to help others do similarly with WeasyPrint.
In particular you might be interested in the article.html example on the article branch: https://gitlab.com/castedo/printstrap/-/blob/article/example/article.html
Also quick clarification: the article.html example in the article branch is actually much more advanced than the live example I give above on popgen.es today. The article.html example is a 2-column format kind of like eLife articles but is fully responsive with the PDF corresponding directly to the HTML content at a particular screen width.
This discussion might be helpful in evaluating Chromium vs not:
https://github.com/singlesourcepub/community/discussions/49
I've partly gone down the WeasyPrint path because I hesitate to rely on Chromium. I consider it an open question whether Chromium is the right tool for specialized HTML -> PDF conversion where the HTML is high constrained and not really a full web page of a website.
I consider it an open question whether Chromium is the right tool for specialized HTML -> PDF conversion where the HTML is high constrained and not really a full web page of a website.
I think Chromium is probably necessary. We need to rely on "newer" CSS properties sometimes, like overflow-wrap and word-break, which are not supported in Weasy. More importantly, we need to rely on JavaScript execution sometimes, like the attributes plugin way of merging table cells together. You could argue that we should find ways to statically do things at build time as much as possible, without javascript, but it would be a significant effort.
Maybe we should also emphasize somewhere in the docs that as a last resort, one can manually print to pdf from the html version in any major browser.
Originally mentioned by @agitter at https://github.com/manubot/rootstock/issues/393#issuecomment-733068778, gotenberg is a:
Since we're looking at replacing athenapdf with pagedjs-cli in https://github.com/manubot/rootstock/issues/394, it also makes sense to evaluate gotenberg.
Links: