useblocks / sphinx-simplepdf

A simple PDF builder for Sphinx documentations
https://sphinx-simplepdf.readthedocs.io
MIT License
32 stars 14 forks source link

[#60] Fix html anchors #61

Closed gitsandi closed 1 year ago

gitsandi commented 1 year ago

For the very first rst file toc entry, Sphinx doesn't add an anchor, but instead uses file's title. Because of that, some links in Table of contents of PDF document are not correct, resulting in wrong page numbers. Fix replaces all such anchors with the correct ones, after the creation of HTML document.

danwos commented 1 year ago

Thanks for the PR. 👍 Looks really clean and compact :)

Can you mention the fix in docs/changelog.rst? Feel also free to add yourself to the AUTHORS file.

kreuzberger commented 11 months ago

Currently i ran into problems with this fix due to Chapters/Sections containing ReST substitutions. Eg.

===========================
|doc_type_name| |project|
============================

Will lead to ids "doc-type-name-project", but the "parsing" of the index html leads to "substituted" text. E.g. link to "Help TrafficLight" if doc_type_name is substitued by "Help" etc.

I think the linking is no problem, the problem is that the id generated and refered is the id of a span element and the heading is on a new page due to the page-break-before css property in sphinx-simplepdf main.css

So using substitutions in headers should be usable cause they are a sphinx feature and the toctree fix should not be applied, maybe the css should be improved.

kreuzberger commented 11 months ago

The toctree fix also breaks documents / ReST files with "sections" not made in pure ReST, e.g if the section comes from "custom directives".

E.g. in a test procedure document, i add sections via my own directives.

The document itself has no title / section in ReST. So the first title tocfix does generate a an invalid link

kreuzberger commented 11 months ago

During investigation of this issue i found it somewhere difficult to handle properly the page-breaks for h1/h2 for all purposes (cover yes/no, sidebar yes/no) for the first occuring headings of this levels in the body.

Therefore i would suggest to enumerate them to handle the page-breaks properly in the css.

diff --git a/sphinx_simplepdf/builders/simplepdf.py b/sphinx_simplepdf/builders/simplepdf.py
index 7ac4427..d4502d5 100644
--- a/sphinx_simplepdf/builders/simplepdf.py
+++ b/sphinx_simplepdf/builders/simplepdf.py
@@ -7,8 +7,6 @@ import weasyprint
 import sass

 from bs4 import BeautifulSoup
-from docutils.nodes import make_id
-

 from sphinx import __version__
 from sphinx.application import Sphinx
@@ -170,8 +168,16 @@ class SimplePdfBuilder(SingleFileHTMLBuilder):
             links = sidebar.find_all("a", class_="reference internal")
             for link in links:
                 link["href"] = link["href"].replace(f"{self.app.config.root_doc}.html", "")
-                if link["href"].startswith("#document-"):
-                    link["href"] = "#" + make_id(link.text)
+
+        for heading_tag in ['h1', 'h2']:
+            logger.debug(f"search heading {heading_tag}")
+            headings = soup.find_all(heading_tag,  class_="")
+            number = 0
+            for heading in headings:
+                logger.debug(f"found heading {heading.attrs}")
+                if not heading.has_attr("id"):
+                    heading.attrs["id"]=f"{heading_tag}-{number}"
+                    number = number + 1

         return soup.prettify(formatter="html")

This would ensure to properly identify the headings in css and handle the prage-breaks properly

As an example my custom css for page page handling

/*break before body after toc to ensure toc page fix */
div.body {
    break-before: always;
}

/* do not repeat title in body, already in cover */
div.body h1{
  display: none;
}

/*no additional page breaks for first h1 in body */
#h1-0 {
    page-break-before: avoid;
    break-before:avoid;
}

/*no additional page breaks for first h2 in body */
#h2-0 {
    page-break-before: avoid;
    break-before:avoid;
}

css selectors like h2:first-of-type and others (nth-child) didnt work for me.

Maybe the handling could implemented in an other ways (class instead of id, stop enumeration after first element etc).

danwos commented 11 months ago

Can you create a new issue for this? It's hard to follow a discussion in an already merged PR :)

I'm not sure if I like the idea of enumerating the headings, as it is not a fixed, reproducible ID. If a new heading gets added, all the following numbers will change and the CSS is no longer valid.

But I don't have a good idea for a different solution. Maybe a hash value could be calculated based on the heading text and then set as ID. But this is also not really a fixed ID.