TOC Tree Fix and Page Break Handling for document handling

kreuzberger commented 1 year ago

Issue #61 breaks some documents with following conditions:

The heading in the rst file has substitutions
The heading in the rst file comes from a directive.

So the fix breaks all documents there anchor and text do not match.

Intentionaly the fix tried to solve the issue that the first toctree entry points to the file itself, not to the first section in the file. From Sphinx point of view this is ok, cause the file is not required to have an heading, but should appear in the toc.

In my opinion the fix could be also done by breaking the document after the body

element or at least at the toctree-wrappped div.

Cause under some circumstances (cover page yes/no, sidebar yes/no) this could lead to additional blank pages due to h1/h2 page-breaks (from simplepdf's main.css).

Therefore following solution works for me:

1) break the document BEFORE the first heading, e.g. in <div class="body">. 2) Disable page breaks on first occurence of h1/h2 in the document body.

For Step2 is was not able to identify the first h1 / h2 properly e.g. with h2:first-of-type or other selectors. Therefore i suggest do add a unique id to the first h1 / h2 elements in the body.

This is for handling the document page-breaks after toc/cover only.

Further h1 h2 elements are / should not be covered. Page breaking should then be done generally (like from main.css) or individual by adding page breaks in the rst document itself.

kreuzberger commented 1 year ago

The following code shows the enumeration of the headings and the use of them in the css. It is not necessary to enumerate ALL headings, in my opinion the first one should be sufficient. OR anybody has a working css solution with some selectors that selects the first element in the body with weasyprint properly.

diff --git a/sphinx_simplepdf/builders/simplepdf.py b/sphinx_simplepdf/builders/simplepdf.py
index 7ac4427..2ff5c8f 100644
--- a/sphinx_simplepdf/builders/simplepdf.py
+++ b/sphinx_simplepdf/builders/simplepdf.py
@@ -7,8 +7,6 @@ import weasyprint
 import sass

 from bs4 import BeautifulSoup
-from docutils.nodes import make_id
-

 from sphinx import __version__
 from sphinx.application import Sphinx
@@ -170,8 +168,13 @@ class SimplePdfBuilder(SingleFileHTMLBuilder):
             links = sidebar.find_all("a", class_="reference internal")
             for link in links:
                 link["href"] = link["href"].replace(f"{self.app.config.root_doc}.html", "")
-                if link["href"].startswith("#document-"):
-                    link["href"] = "#" + make_id(link.text)
+
+        for heading_tag in ['h1', 'h2']:
+            logger.debug(f"search heading {heading_tag}")
+            heading = soup.find(heading_tag,  class_="")
+            logger.debug(f"found heading {heading.attrs}")
+            if not heading.has_attr("id"):
+                heading.attrs["id"]=f"{heading_tag}-0"

         return soup.prettify(formatter="html")

This would ensure to properly identify the headings in css and handle the prage-breaks properly

As an example my custom css for page page handling

/*break before body after toc to ensure toc page fix */
div.body {
    break-before: always;
}

/* do not repeat title in body, already in cover */
div.body h1{
  display: none;
}

/*no additional page breaks for first h1 in body */
#h1-0 {
    page-break-before: avoid;
    break-before:avoid;
}

/*no additional page breaks for first h2 in body */
#h2-0 {
    page-break-before: avoid;
    break-before:avoid;
}

kreuzberger commented 1 year ago

reply to @danwos comments: https://github.com/useblocks/sphinx-simplepdf/pull/61#issuecomment-1655080405 to have all in this new issue

The main issue for me here is to have e.g. identical layouts in css files of same types, i.e. all of my "Online" help files (help.css) or my specification documents (specification.css). I would not like to rely on the content and there for a generic id for the first header in the body would help me to adress this correctly.

If i would require individual, content based identifiers i would use css selectors from section.ids > h2 or something like that. The sections have ids with content dependent ids.

As i stated above, this could maybe also done via other methods, but all of my tries with selectors first-of-type, nth, nth-child on h2 or nested with the divs and sections weren't successfull.

More or less important is to "revert" the #61 cause this breaks the tocs and get weasyprint warnings due to unresolved anchors. But reverting "without" any other changes would lead to the original problem again.

danwos commented 1 year ago

Good point. Maybe instead of using the id, we could set classes for:

first
last
even/odd

kreuzberger commented 1 year ago

id or classes i don't care, i am no html expert which could start a discussion of what is better/semantical correct.

Headings with even/odd may help for two page layout. So ok, i would implement it as classes.

kreuzberger commented 1 year ago

During merging those several PR's something went wrong in conflict resolving. There are missing two lines from the original fix. I open a new PR to include this again

kreuzberger commented 1 year ago

Issue could now be closed. Improvement / Wishes could be to extend this to more than h1, h2 levels. Maybe this should not be "hardcoded", maybe this should be configurable or just applied to ALL headers. Waiting for feature request :smile:

danwos commented 1 year ago

Ok, I close this issue. Feel free to reopen if feature requests show up :) And thanks for the implementation.

useblocks / sphinx-simplepdf

TOC Tree Fix and Page Break Handling for document handling #82