Open snowch opened 4 years ago
I am also interested in this functionality, but reviewing the code and the documentation for puppeteer page.pdf()
, it is quite limited.
https://pptr.dev/#?product=Puppeteer&version=v5.3.1&show=api-pagepdfoptions
One thing I noticed, for example, is that it does not recognize and convert the links and the documentation for page.pdf()
does not seem to indicate that there is an option to do so and I do not see an ability to do this with pdf-lib as it is too low level. In other words, inter document links are not currently converted (quite frankly not sure if this is because of how Docusaurus builds the URLs or because of pdf-lib). I suppose that it may be possible to address this by cleaning the HTML href
s before feeding it into page.pdf()
.
The DocFx project also has the ability to generate PDFs. That project uses wkhtmltopdf which seems to be a higher level abstraction that does provide the ability to convert the links to internal links and also includes OOB ability to generate a table of contents as well as an outline.
The documentation is here: https://wkhtmltopdf.org/usage/wkhtmltopdf.txt
Table Of Contents:
A table of contents can be added to the document by adding a toc object to the
command line. For example:
wkhtmltopdf toc https://qt-project.org/doc/qt-4.8/qstring.html qstring.pdf
The table of contents is generated based on the H tags in the input documents.
First a XML document is generated, then it is converted to HTML using XSLT.
The generated XML document can be viewed by dumping it to a file using the
--dump-outline switch. For example:
wkhtmltopdf --dump-outline toc.xml https://qt-project.org/doc/qt-4.8/qstring.html qstring.pdf
The XSLT document can be specified using the --xsl-style-sheet switch. For
example:
wkhtmltopdf toc --xsl-style-sheet my.xsl https://qt-project.org/doc/qt-4.8/qstring.html qstring.pdf
The --dump-default-toc-xsl switch can be used to dump the default XSLT style
sheet to stdout. This is a good start for writing your own style sheet
wkhtmltopdf --dump-default-toc-xsl
The XML document is in the namespace "http://wkhtmltopdf.org/outline" it has a
root node called "outline" which contains a number of "item" nodes. An item
can contain any number of item. These are the outline subsections to the
section the item represents. A item node has the following attributes:
* "title" the name of the section.
* "page" the page number the section occurs on.
* "link" a URL that links to the section.
* "backLink" the name of the anchor the section will link back to.
The remaining TOC options only affect the default style sheet so they will not
work when specifying a custom style sheet.
To me, it seems like adding the ability to generate a ToC and outline (left navigation) would be tantamount to rewriting this library quite substantially.
Does it make sense to do so? Or perhaps fork the project or start a new project?
I found this repo which seems to support to generate TOC (Though I didn't check in deep)
https://github.com/simologos/papersaurus
Puppeteer does not yet support the generation of TOCs. See this feature request and this Chromium bug. Therefore this package generates a PDF, then parses it again to update the page numbers in the TOC. Therefore the pdfFooterParser...
This approach looks decent to me.
Does this generate the table of contents? It appears not, but this would be really good to have.