sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.59k stars 2.12k forks source link

Ensure deterministic resolution of toctree #12888

Closed khanxmetu closed 1 month ago

khanxmetu commented 2 months ago

Subject: Ensure deterministic toctree generation

Feature or Bugfix

Purpose

Detail

Relates

khanxmetu commented 1 month ago

This PR only resolves the issue of non-determinism, however users may still have their own expectations of what parent should be chosen which is not trivial to solve. I think we should warn the user when a document is included in multiple toctrees (different files). What do you think?

Edit: I have added the warning in the recent commit but feel free to discard.

khanxmetu commented 1 month ago
  • Whether tree depth might be better than lexicographic order as a sort key (@khanxmetu has also noted this as a possibility in an updated comment)

This seems a better approach than solely relying upon lexicographic order. However note that sorting and lexicographic order is still required to break the tie in case of equal path depths. Besides that, are you proposing for minimum or maximum path depth to be chosen?

  • A question really: it puzzles me slightly that we have nondeterminism here and not in the breadcrumb navigation trail (e.g. System Emulation > QEMU System Emulator Targets > PowerPC System emulator > pSeries family boards (pseries) > ... in the header of the referenced qemu page). Does that component gather the navtree differently? (I should, and will eventually, check)

Thank you for bringing this up. I found two important issues:

  1. It's true that the navtree header is different than what is shown in the sidebar but the problem in this specific case of qemu is not related to determinism. I've opened a new issue for that: #12926

  2. You're right that navtree header is gathered differently than the global toctree and this needs to be considered here. The parents in the navtree header come from BuildEnvironment.collect_relations() which uses _traverse_toctree() generator. Here an in order tree traversal takes place and the relations: parent, prev, next are generated in one pass. The parent obtained from in order traversal is the one that is discovered first in traversal and doesn't depend on lexicograph order or path depth. We also need to make sure this is consistent with our desired behavior of toctree generation. (In case of qemu’s specs/ppc-spapr-numa, navtree header happens to be generated consistently with toctree ancestors by chance).

I propose that shortest depth path should be chosen. We can add a BFS traversal function similar to _traverse_toctree to find the parents for all nodes satisfying shortest path. I would also suggest that we precompute desired parent mapping and use the same implementation for _get_toctree_ancestors.

jayaddison commented 1 month ago
  • Whether tree depth might be better than lexicographic order as a sort key (@khanxmetu has also noted this as a possibility in an updated comment)

This seems a better approach than solely relying upon lexicographic order. However note that sorting and lexicographic order is still required to break the tie in case of equal path depths. Besides that, are you proposing for minimum or maximum path depth to be chosen?

Thanks @khanxmetu, that makes sense that a tiebreaker will still be required if we use a path-depth approach.

To answer whether I'm suggesting to include path-depth in the sorting: initially I say no, let's continue to use solely lexicographic sorting -- because whatever method we choose, I think some projects/pages will still emit sidebar navigation menus that seem unexpected to some people, because the origin of the ambiguity is in the source files.

Even so, I think additional discussion of the navtree is worthwhile:

  1. You're right that navtree header is gathered differently than the global toctree and this needs to be considered here. The parents in the navtree header come from BuildEnvironment.collect_relations() which uses _traverse_toctree() generator. Here an in order tree traversal takes place and the relations: parent, prev, next are generated in one pass. The parent obtained from in order traversal is the one that is discovered first in traversal and doesn't depend on lexicograph order or path depth. We also need to make sure this is consistent with our desired behavior of toctree generation. (In case of qemu’s specs/ppc-spapr-numa, navtree header happens to be generated consistently with toctree ancestors by chance).

I wouldn't worry too much about ensuring that the navtree is always consistent with the sidebar toctree; as you've noticed in #12926, table-of-contents displays can be customized by themes and layout; my sense is that it could be difficult to get them to correspond precisely, and also that additional tree traversals might introduce unpredictable build performance changes (especially for large projects).

To resolve the bug we can focus on making the build output stable (ensuring that it stays the same for two or more subsequent builds) - and your changeset here already does that.

If it seems easy to re-use logic between _traverse_toctree and _get_toctree_ancestors, -- or alternatively to rename them to make them more distinct -- then that could be a nice subsequent cleanup as a separate pull request, but I don't think that's required initially.

AA-Turner commented 1 month ago

Our docs are failing... should we consider "INFO" for this or is "WARNING" appropriate? This is just to fix determinism so I think "WARNING" is a bit heavy...

/home/runner/work/sphinx/sphinx/doc/usage/configuration.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/configuration [toc.multiple_toc_parents]
  /home/runner/work/sphinx/sphinx/doc/usage/extensions/index.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/extensions/index [toc.multiple_toc_parents]
  /home/runner/work/sphinx/sphinx/doc/usage/restructuredtext/index.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/restructuredtext/index [toc.multiple_toc_parents]
jayaddison commented 1 month ago

Our docs are failing... should we consider "INFO" for this or is "WARNING" appropriate? This is just to fix determinism so I think "WARNING" is a bit heavy...

/home/runner/work/sphinx/sphinx/doc/usage/configuration.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/configuration [toc.multiple_toc_parents]
  /home/runner/work/sphinx/sphinx/doc/usage/extensions/index.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/extensions/index [toc.multiple_toc_parents]
  /home/runner/work/sphinx/sphinx/doc/usage/restructuredtext/index.rst: WARNING: document is referenced in multiple toctrees: ['usage/index', 'index'], selecting: usage/index <- usage/restructuredtext/index [toc.multiple_toc_parents]

These seem to be false-positives in the case of the Sphinx documentation: yes there are multiple entries, but in each of these three examples the ancestor path always resolves to the same value (in other words: index + usage/configuration == index/usage + configuration).

So I'd probably agree with downgrading to info-level messaging for now.

AA-Turner commented 1 month ago

false-positives

Is there a simple way to avoid these? False positives are pretty frustrating for users.

A

jayaddison commented 1 month ago

Maybe; if so I think it may require some kind of relative-path-resolution approach (I'm experimenting with it at the moment).

jayaddison commented 1 month ago

I've a fairly terrible work-in-progress patch:

diff --git a/sphinx/environment/__init__.py b/sphinx/environment/__init__.py
index 28e53aa7d..bec8fef75 100644
--- a/sphinx/environment/__init__.py
+++ b/sphinx/environment/__init__.py
@@ -789,19 +789,24 @@ def _traverse_toctree(

 def _check_toc_parents(toctree_includes: dict[str, list[str]]) -> None:
-    toc_parents: dict[str, list[str]] = {}
+    toc_parents: dict[str, set[str]] = {}
     for parent, children in toctree_includes.items():
         for child in children:
-            toc_parents.setdefault(child, []).append(parent)
+            base = path.commonprefix([parent, child])
+            if base and base.endswith('/'):
+                parent = parent.removeprefix(base)
+                child = child.removeprefix(base)
+            toc_parents.setdefault(child, set()).add(f"{parent}/{child}")

     for doc, parents in sorted(toc_parents.items()):
         if len(parents) > 1:
+            candidates = sorted(parent.removesuffix(f"/{doc}") for parent in parents)
             logger.info(
                 __(
                     'document is referenced in multiple toctrees: %s, selecting: %s <- %s'
                 ),
-                parents,
-                max(parents),
+                candidates,
+                max(candidates),
                 doc,
                 location=doc,
                 type='toc',

However I'm yet to convince myself that it's correct, and it's definitely not clean code.

jayaddison commented 1 month ago

No; I think it's incorrect due to the absence of the base component from the combined {parent}/{child} path.

jayaddison commented 1 month ago

I think this is more like it, but supporting unit test coverage is required:

diff --git a/sphinx/environment/__init__.py b/sphinx/environment/__init__.py
index 28e53aa7d..9780ef396 100644
--- a/sphinx/environment/__init__.py
+++ b/sphinx/environment/__init__.py
@@ -789,19 +789,27 @@ def _traverse_toctree(

 def _check_toc_parents(toctree_includes: dict[str, list[str]]) -> None:
-    toc_parents: dict[str, list[str]] = {}
+    toc_parents: dict[str, set[str]] = {}
     for parent, children in toctree_includes.items():
         for child in children:
-            toc_parents.setdefault(child, []).append(parent)
+            # Remove duplicate ancestry if it exists
+            base = path.commonprefix([parent, child])
+            if base and base.endswith('/'):
+                parent = parent.removeprefix(base)
+                child = child.removeprefix(base)
+            # Record de-duplicated toctree routes to each document
+            absolute_path = "/".join(filter(None, [base, parent]))
+            toc_parents.setdefault(child, set()).add(absolute_path)

     for doc, parents in sorted(toc_parents.items()):
         if len(parents) > 1:
+            candidates = sorted(parent for parent in parents)
             logger.info(
                 __(
                     'document is referenced in multiple toctrees: %s, selecting: %s <- %s'
                 ),
-                parents,
-                max(parents),
+                candidates,
+                max(candidates),
                 doc,
                 location=doc,
                 type='toc',
jayaddison commented 1 month ago

Possibly also wrong, in particular due to incorrect in-place modification of the child variable.

jayaddison commented 1 month ago

I'll try to write some unit test coverage within the next day or so to help implement more-precise messaging _check_toc_parents under various conditions. If I don't get around to that, then I'd suggest we remove the relevant messages since they can be misleading.

khanxmetu commented 1 month ago

These seem to be false-positives in the case of the Sphinx documentation: yes there are multiple entries, but in each of these three examples the ancestor path always resolves to the same value (in other words: index + usage/configuration == index/usage + configuration).

@jayaddison I don't quite understand how are the warnings false-positive.

I don't think it's best to think of ancestor path as a single string representing path to root because the underlying implementation does not have any sense of directories and is merely a list of docnames/docpaths entities as strings each pointing to the next child in chain.
Also maybe you missed index on RHS in your example, I think it should appear on both sides since it's the root. From my understanding the comparison should be as follows:

['index', 'usage/configuration'] != ['index', 'usage/index', 'usage/configuration']. 

A graphical representation similar to what I had in tests:

        'index'
        /     \
       /       \
      /         \
'usage/index'    \
       \          \
        \          \
     'usage/configuration'

There exists an ambiguity as to what parent should be chosen by 'usage/configuration'. In serial builds or the patch in this pr, lexicographically greatest parent i.e 'usage/index' is chosen due to which "Configuration" section is expanded under "Using Sphinx" in the sidebar.

Had it been that 'index' > 'usage/index' (for example let's say we had similar structure but for appendix: 'index' > 'appendix/index'), in sidebar the contents would've have been expanded at the root level section instead.

jayaddison commented 1 month ago

Ok, thanks @khanxmetu. What you say makes sense. I suppose the particular scenario I'm considering is cases where apparently-ambiguous toctree references all in fact map to a single path.

In other words: a/b/c and a/c may be less ambiguous in some sense than a/b/c and a/d/c -- the latter contains two genuinely distinct navigation routes, while the first contains a single route but allows for a jump over the intermediate step.

jayaddison commented 1 month ago

Also maybe you missed index on RHS in your example, I think it should appear on both sides since it's the root. From my understanding the comparison should be as follows:

['index', 'usage/configuration'] != ['index', 'usage/index', 'usage/configuration']. 

It's certainly possible that I made a mistake here; but I think that by implicitly expanding an index path for any non-index document (e.g. usage/configuration has an implicit parent of usage/index) that it's feasible to resolve these.

jayaddison commented 1 month ago

My apologies: reading the code, it appears that, apart from the root document, docpaths ending in index are purely used by convention to provide something like a familiar directory structure -- the term index does not have a special meaning. As a result, I don't think we can (or should) infer their implicit existence.

Given that, I think the existing info-level messaging here is fine -- it does not indicate a false-positive, since the ambiguity that @khanxmetu mentions is genuine:

There exists an ambiguity as to what parent should be chosen by 'usage/configuration'. In serial builds or the patch in this pr, lexicographically greatest parent i.e 'usage/index' is chosen due to which "Configuration" section is expanded under "Using Sphinx" in the sidebar.

I'll re-approve the PR to indicate that.

jayaddison commented 1 month ago

@AA-Turner it's taken me a longwinded route to get there, but I believe that the ambiguity messages about the Sphinx documentation are genuine, in that they make it unclear (in machine terms, if not in human terms) about how the corresponding sidebar toctree should be expanded.

I still feel that it might be possible somehow to devise a more advanced algorithm that can resolve a subset of the multi-toctree reference cases unambiguously, however I don't have 100% confidence that it's feasible in a performant way (mostly I don't feel I have a complete mental model of the problem space yet), nor do I think we should do it during this PR even if it is.

So in short: I'm comfortable with merging the pull request.

timhoffm commented 1 month ago

Semi-OT: from https://github.com/sphinx-doc/sphinx/issues/6714#issuecomment-2351757525

The main issue here is the assumption of _get_toctree_ancestors that the toctree is indeed a tree (meaning each node/doc has a unique parent)

Is a non-tree toc arrangement really desirable? Or is it just a structure that happens to be allowed through the way .. toctree is designed and implemented?

I mean the name "toctree" already implies a tree. And from a user prespective, I find a clear hierarchial TOC helpful to understand the structure of docs. For my docs, I'd rather place a topic in one place and cross-reference from another place, rather than including it fully in two locations. So, while it's nice to make this deterministic, the ambiguity persists: I go into a subtopic from one location and find myself in a completely different toc path. I would consider that a bug for my documentation. Wouldn't it be better (or at least an option) to warn on multiple inclusions of the same document?

khanxmetu commented 1 month ago

@timhoffm The existing implementation doesn't fully account for non-tree tocs and I don't think if there's any good way to resolve multiple parents/references in navigation bar without jumping locations in toc as you mentioned.

This PR has two purposes in case of multiple parents:

  1. Ensure determinism to achieve reproducible parallel builds.
  2. Warn the user, though it has been downgraded to info-level output currently due to less severity.

The workflow you mentioned is ideal for local toc (in-document, see the issue with qemu-docs for example where toctree directive is (ab)used for spapr-numa: https://github.com/qemu/qemu/blob/master/docs/system/ppc/pseries.rst?plain=1#L144). However, in some cases it could be desirable to have a secondary reference in global toctree (side-bar navigation) as is the case with Sphinx's documentation where "Configuration" link is also put on the root-level in the sidebar for convenience.

If this issue seems significant, perhaps, in the future we could have primary(to be used as toc parent) and secondary(not to be used as toc parent) toc entries in the case of conflicts.