sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.39k stars 2.09k forks source link

Document how `sphinx`'s change detection works #11556

Open dbitouze opened 1 year ago

dbitouze commented 1 year ago

Continuous integration with Sphinx-doc, as described here, works well.

Unfortunately, on gitlab.com, it is Docker based and clones each repository fresh when it starts running continuous integration. So even if a single source file is modified, all the corresponding HTML pages of all the .rst source files are rebuilt (although the cache claims to be restored and, indeed, so is the doctree directory). This isn't a problem if there are only a few source files, but it becomes unusable if there are a lot (more than 1,200 in my real-life use case: the rebuild takes more than 15 minutes and a lot of resources are consumed unnecessarily).

This problem may be due to the fact that Git, unlike other version control systems, does not preserve the original timestamp of committed files. So relying on git-restore-mtime should be a solution. But this is not the case, as you can see with the following sandbox repository:

https://gitlab.com/denisbitouze/minimal-sphinx-minimal/

where the commit changes (only) the source test.rst file but triggers also the rebuild of the index.html file corresponding to the index.rst source file that hasn't been changed.

I've had a look at the code but can't work out how sphinx change detection works. Would it be possible to document this? It would be very useful, especially nowadays when CI/CD is becoming more and more popular and useful.

picnixz commented 1 year ago

The reason why the HTML was rebuilt is because the HTML file did not exist anymore. The _build/html is the output HTML directory but the latter is only put as an artifact and not cached (at least that's how I see it, I haven't checked that this is really the case).

Concerning your documentation request, to my understanding, this is what happens:

Let's illustrate this by an example. Assume that we have the following RST files:

.. index.rst 

.. toctree::
   :maxdepth: 1

   bar.rst
.. foo.rst

The Foo
-------
foo
.. bar.rst

The Bar
-------
and the foo:

.. include:: foo.rst
dbitouze commented 1 year ago

The reason why the HTML was rebuilt is because the HTML file did not exist anymore. The _build/html is the output HTML directory but the latter is only put as an artifact and not cached (at least that's how I see it, I haven't checked that this is really the case).

Do you see how this could be worked around? I tried to replace _build/doctrees by _build/html in:

https://gitlab.com/denisbitouze/minimal-sphinx-minimal/-/blob/main/.gitlab-ci.yml#L4-L6

but with no success. Do you advise for the _build/html content to be :

  1. before sphinx build, downloaded to,
  2. and after the sphinx build, upload from,

somewhere on the Cloud?

Concerning your documentation request, to my understanding, this is what happens:

[...]

Many thanks for this detailed explanation! Unfortunately I don't see how I could use in the context of the CI/CD.

picnixz commented 1 year ago

Actually, I think you can just add _build/html as a cached path as well:

  cache:
    paths:
    - _build/doctrees
    - _build/html

Unfortunately I don't see how I could use in the context of the CI/CD.

One way to do it is to have a custom extension and an event handler for the env-get-outdated event if you want to hack into things. But for your specific CI/CD, I think you simply need to cache the _build directory (you could actually cache the entire directory).

dbitouze commented 1 year ago

Actually, I think you can just add _build/html as a cached path as well:

  cache:
    paths:
    - _build/doctrees
    - _build/html

Unfortunately, it doesn't work. After this addition and a single change of (only) test.rst, the index.html file is rebuilt as well.

Unfortunately I don't see how I could use in the context of the CI/CD.

One way to do it is to have a custom extension and an event handler for the env-get-outdated event if you want to hack into things.

Well, I'm afraid this is far beyond my scope :$

But for your specific CI/CD, I think you simply need to cache the _build directory (you could actually cache the entire directory).

Unfortunately, it doesn't work either.

picnixz commented 1 year ago

Ok, I've looked a bit more. The reason why the index is rebuilt is because there is a toctree directive. Actually, files (maybe not all) containing such directive will be automatically rebuilt.

https://github.com/sphinx-doc/sphinx/blob/8a990db49eb4fc19850f6d2964fe949884a6e303/sphinx/builders/__init__.py#L550-L555

When you are "included" as a file in a toctree, you are marked as a dependency of that file (I agree that this is not clearly stated). So, you won't be able to escape from the fact the index is rebuilt everytime.

dbitouze commented 1 year ago

The reason why the index is rebuilt is because there is a toctree directive. Actually, files (maybe not all) containing such directive will be automatically rebuilt.

The problem concerns other files as well: I (only) have added another test-bis.rst source file but the test.html file is rebuilt as well.

picnixz commented 1 year ago

I cannot reproduce this locally. I'd advise you to check it locally by the way. By looking at the traceback, it appears:

[build target] targetname '/builds/denisbitouze/minimal-sphinx-minimal/_build/html/test.html'(2023-08-07 09:45:27+00:00), template(2023-08-07 10:27:49.543082+00:00), docname 'test'(2023-08-07 09:28:33+00:00)

Here (up to a timestamp shift):

The "real" source time is actually set to the template timestamp because it's the largest. Since 10:27 was after 09:45, it causes test to be rebuilt. And I think I know the cause of it:

dbitouze commented 1 year ago

I cannot reproduce this locally. I'd advise you to check it locally by the way.

Indeed, locally, everything works as expected. That's the point: I worked a lot on the migration of our FAQ from DokuWiki to Sphinx-doc precisely because the latter is very nice and works pretty well locally: if a single (.rst or .md) source file (among the more than 1,200 ones) is modified, only the corresponding .html file is rebuilt. But, AFAIU, because GitLab relies on Docker (which relies on a git fetch), that's not the case when CI/CD is involved.

I'll have a look at the other part of your answer later. Many thanks!

dbitouze commented 1 year ago

Here (up to a timestamp shift):

  • the target is test.html and has been modified at 09:45.

  • the template timestamp indicates the latest HTML file time of modification due to a change in the HTML template. Here it appears that it's at 10:27.

  • the source is test and was modified at 09:28.

The "real" source time is actually set to the template timestamp because it's the largest. Since 10:27 was after 09:45, it causes test to be rebuilt.

Looks very interesting!

And I think I know the cause of it:

  • When you install Sphinx, you also install jinja templates. We have a bunch of bundled templates as well. But because you install them, you end up having different timestamps for those library files (I think). This is the reason why the template seems to be always up-to-date. So you need to actually cache the pip dependencies as well (or at least some of them). I think if you look at the timestamps of the files ending with .html in the /path/to/site-packages/alabaster and /path/to/site-packages/sphinx/sphinx/themes/basic folders, you'll find those template timestamps.

I don't understand where I'm supposed to have a look at these files ending with .html. I found them locally:

$ ls /home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/*html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/defindex.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/domainindex.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/genindex.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/genindex-single.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/genindex-split.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/globaltoc.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/layout.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/localtoc.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/page.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/relations.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/searchbox.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/searchfield.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/search.html
/home/bitouze/.local/lib/python3.8/site-packages/sphinx/themes/basic/sourcelink.html
$ ls /home/bitouze/.local/lib/python3.8/site-packages/alabaster/*.html
/home/bitouze/.local/lib/python3.8/site-packages/alabaster/about.html
/home/bitouze/.local/lib/python3.8/site-packages/alabaster/donate.html
/home/bitouze/.local/lib/python3.8/site-packages/alabaster/layout.html
/home/bitouze/.local/lib/python3.8/site-packages/alabaster/navigation.html
/home/bitouze/.local/lib/python3.8/site-packages/alabaster/relations.html

Moreover, I must admit I don't know how to “cache the pip dependencies as well”.

And, in my real use case, I use other extensions and theme:

extensions = [
    'sphinx_comments',
    'sphinx.ext.todo',
    'sphinx.ext.mathjax',
    'sphinx.ext.extlinks',
    'sphinx_design',
    'sphinxext.opengraph',
    'sphinx.ext.intersphinx',
    'myst_parser',
]

html_theme = 'furo'
picnixz commented 1 year ago

In the gitlab configuration, you have pip install sphinx. Since pip dependencies are not cached, you end up with a fresh sphinx package and in particular, the files that you found locally are newer (they are always recreated). For caching pip dependencies, you need to play with PIP_CACHE_DIR as well and gitlab cache system. You should look at the official documentation for pip/gitlab for that or ask for help on Stackoverflow because this is no more an issue on our side I think.

Also, technically speaking, a CI/CD job should actually run the whole flow and not in an incremental manner by default. As such, I don't think we need to change our workflow example (or it would be a low priority task).

dbitouze commented 1 year ago

In the gitlab configuration, you have pip install sphinx. Since pip dependencies are not cached, you end up with a fresh sphinx package and in particular, the files that you found locally are newer (they are always recreated). For caching pip dependencies, you need to play with PIP_CACHE_DIR as well and gitlab cache system. You should look at the official documentation for pip/gitlab for that or ask for help on Stackoverflow because this is no more an issue on our side I think.

OK, I'll try this: many thanks again!

Also, technically speaking, a CI/CD job should actually run the whole flow and not in an incremental manner by default.

Why if the incremental way (as can be seen locally) does all and only what is needed? And, as I said, running the whole flow unnecessarily consumes time and resources.

As such, I don't think we need to change our workflow example (or it would be a low priority task).

Would be very, very nice for the tutorial to expose both the whole flow and the incremental manners.

dbitouze commented 1 year ago

A suggestion has been made to me outside of here concerning the fact that the date which seems to justify the "out of date" is not that of the source file (the .rst) but the date of the template (no doubt the default one installed with sphinx).

This involves creating a Python virtual environment in '_build', activating it, installing 'sphinx' and adding this virtual environment to the cache (like _build/html). In practical terms, this would mean replacing the lines:

  - pip3 install -U pip
  - pip3 install -U sphinx
  - apt-get update
  - apt-get install git-restore-mtime -y

with:

  - apt-get update
  - apt-get install git-restore-mtime -y
  - python3 -m venv _build/venv
  - source _build/venv/bin/activate
  - pip install sphinx

and add the _build/venv directory to the cache.

With these modifications, the trigger for rebuild is:

writing output... [build target] did not in env: 'test-bis' [build target] did not in env: 'test' [build target] did not in env: 'index' building [html]: targets for 3 source files that are out of date

We admit that we don't really understand these three messages (with “did not in env”).

According to sphinx's sources, it's in https://www.sphinx-doc.org/en/master/_modules/sphinx/builders/html.html:

for docname in self.env.found_docs:
            if docname not in self.env.all_docs:
                logger.debug('[build target] did not in env: %r', docname)

But we confess we don't know what self.env.found_docs and self.env.all_docs...

In any case, this message isn't explicit enough to give us a clue...

picnixz commented 1 year ago

Why if the incremental way (as can be seen locally) does all and only what is needed? And, as I said, running the whole flow unnecessarily consumes time and resources.

Ok, I shouldn't have phrased it like that. What I meant is that you are running a "fresh environment" (Docker) everytime so it is correct to assume that everything should be generated as if it was the first time. In order to make it incremental, the environment itself must be configured differently (which is what we are trying to achieve now).

We admit that we don't really understand these three messages (with “did not in env”).

Thank you for making me remember this. Actually, it should be "did not exist" and I forgot to fix the typo. Now,

So if you find a document that was never read once, this means you need to generate the corresponding file accordingly.

and add the _build/venv directory to the cache.

I wouldn't cache the whole venv directory. It's better to only cache the pip packages. Also, I would cache them in another directory and add it to the exclude_pattern configuration variable. You should definitely read this section.

dbitouze commented 1 year ago

Why if the incremental way (as can be seen locally) does all and only what is needed? And, as I said, running the whole flow unnecessarily consumes time and resources.

Ok, I shouldn't have phrased it like that. What I meant is that you are running a "fresh environment" (Docker) everytime so it is correct to assume that everything should be generated as if it was the first time.

Do you mean that what is explained here:

Caching in GitLab CI/CD

A cache is one or more files a job downloads and saves. Subsequent jobs that use the same cache don’t have to download the files again, so they execute more quickly.

To learn how to define the cache in your .gitlab-ci.yml file, see the cache reference.

couldn't apply to the source files of a Sphinx website?

In order to make it incremental, the environment itself must be configured differently (which is what we are trying to achieve now).

OK.

We admit that we don't really understand these three messages (with “did not in env”).

Thank you for making me remember this. Actually, it should be "did not exist" and I forgot to fix the typo. Now,

  • found_docs: all documents that were found (on the filesystem) for this build

  • all_docs: mapping from documents that were read to their time of reading.

So if you find a document that was never read once, this means you need to generate the corresponding file accordingly.

But does this apply to source documents or generated documents?

and add the _build/venv directory to the cache.

I wouldn't cache the whole venv directory. It's better to only cache the pip packages. Also, I would cache them in another directory

Is .cache/pip an advised other directory?

and add it to the exclude_pattern configuration variable.

AFAICS, the exclude_pattern configuration variable is just for source files so I don't see the connection with pip packages.

You should definitely read this section.

I read this section and tried to apply what it advises. But, same punishment: test.html rebuilt even with only test-bis.rst changed.

picnixz commented 1 year ago

Do you mean that what is explained here couldn't apply to the source files of a Sphinx website?

At least it cannot be applied by default. While the sources are properly cached, because you are installing Sphinx everytime, thereby refreshing the timestamp of the Sphinx HTML themes (the template that was detected newer).

AFAICS, the exclude_pattern configuration variable is just for source files so I don't see the connection with pip packages.

Yes my bad. I just wanted to exclude everything from being read in case of some unexpected behaviour.

Is .cache/pip an advised other directory?

Probably ? I would just say "not the same directory as the _build". Execute pip cache dir to know whether this is the expected cache for pip. If not, you can specify it using --cache-dir or via PIP_DOWNLOAD_CACHE environment variable.

I read this section and tried to apply what it advises.

The problem is this: pip install sphinx. You are never telling pip to use its cache. It's as if you are reinstalling from scratch the Sphinx package:

$ pip install sphinx Collecting sphinx Downloading sphinx-7.1.2-py3-none-any.whl (3.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 30.6 MB/s eta 0:00:00

You'll see that you .cache/pip directory does not contain the previous Sphinx version so it's not good!

dbitouze commented 1 year ago

Do you mean that what is explained here couldn't apply to the source files of a Sphinx website?

At least it cannot be applied by default. While the sources are properly cached, because you are installing Sphinx everytime, thereby refreshing the timestamp of the Sphinx HTML themes (the template that was detected newer).

OK.

AFAICS, the exclude_pattern configuration variable is just for source files so I don't see the connection with pip packages.

Yes my bad. I just wanted to exclude everything from being read in case of some unexpected behaviour.

Sorry, I don't understand what you mean here.

Is .cache/pip an advised other directory?

Probably ? I would just say "not the same directory as the _build". Execute pip cache dir to know whether this is the expected cache for pip. If not, you can specify it using --cache-dir or via PIP_DOWNLOAD_CACHE environment variable.

This seems to be the case:

$ pip cache dir
/builds/denisbitouze/minimal-sphinx-minimal/.cache/pip

I read this section and tried to apply what it advises.

The problem is this: pip install sphinx. You are never telling pip to use its cache.

You're certainly right but I don't see how to do so.

It's as if you are reinstalling from scratch the Sphinx package:

$ pip install sphinx Collecting sphinx Downloading sphinx-7.1.2-py3-none-any.whl (3.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 30.6 MB/s eta 0:00:00

You'll see that you .cache/pip directory does not contain the previous Sphinx version so it's not good!

OK but all the content of this directory is of the following cryptic form:

535126 68 -rw------- 1 root root 67571 Aug 10 10:55 .cache/pip/http/5/b/d/8/9/5bd894eeb3dfe1c8aaee1daecdfb74bbb314293813a730238621f077

Moreover:

$ pip cache list setuptools
No locally built wheels cached.

I also tried to install from local packages but no success...

picnixz commented 1 year ago

Sorry, I don't understand what you mean here.

It's ok. I was wrong before so just forget what I said.

OK but all the content of this directory is of the following cryptic form:

Oh ok my bad. I thought we would have a more readable structure but it appears not.


Not entirely sure, but the environment variable PIP_CACHE_DIR may not be correctly named. In the latest pip version, I think it should be PIP_DOWNLOAD_CACHE instead (so maybe gitlab is the culprit here). Anyway, when you are testing things, I think you need to always wipe the cache before, test the flow once, commit a modified file, retest the flow and check.

dbitouze commented 1 year ago

Not entirely sure, but the environment variable PIP_CACHE_DIR may not be correctly named. In the latest pip version, I think it should be PIP_DOWNLOAD_CACHE instead (so maybe gitlab is the culprit here).

I tried with PIP_DOWNLOAD_CACHE: "PIP_CACHE_DIR: $CI_PROJECT_DIR/.cache/pip" instead of PIP_CACHE_DIR: "PIP_CACHE_DIR: $CI_PROJECT_DIR/.cache/pip" but with no success. I read that it is deprecated.

Anyway, when you are testing things, I think you need to always wipe the cache before, test the flow once, commit a modified file, retest the flow and check.

That's what I applied but with no success.

Once again, I tried to install from local packages. This time, I was able to install from the .cache/pip directory but still all the .html files are written. You told me earlier:

You are never telling pip to use its cache.

Is “install from local packages” the right way of doing so?

dbitouze commented 1 year ago

Well, about the sphinx-doc template timestamp, isn't it hopeless to rely on pip cache since, anyway, we run a pip install sphinx?

Wouldn't it be possible to instead rely, not only on a Python Docker image (e.g. python:3.9.17-bookworm), but on a Docker image which already contain an installed sphinx-doc (if at all possible)?

dbitouze commented 1 year ago

Another possibility was suggested to me: to rely on a script that would compile only the minimal set of documents as needed.

Such a script would be like the following (that I couldn't test because I don't know how to deal with the $GL_API_ACCESS_TOKEN:

#!/usr/bin/env bash

# Script to compile only the minimal set of documents as needed. Based on the
# assumption that the artifacts of previous compilations are available, so that
# it suffices to actually build these incrementally. Rebuild everything if the
# templates change.

# Abort execution on error.
set -e

# As we need a private access token with more privileges than CI_JOB_TOKEN, we
# need to get a valid token from the environment.
if [ -z "$GL_API_ACCESS_TOKEN" ]; then
  echo "Invalid GitLab API access token." >&2
  exit 1
fi

# Get the Git SHA1 hash of the latest pipeline on master that succeeded (i.e.
# finished before the one we are running in). This is a poor man's JSON parser
# which extracts only the `sha` field of the first object of the JSON array
# which is the wanted one due to the sorting option. A SHA1 hash is always
# 40 characters in length which is sanity-checked below.
gitsha="$(curl --header "PRIVATE-TOKEN: $GL_API_ACCESS_TOKEN" "https://gitlab.com/api/v4/projects/$CI_PROJECT_ID/pipelines?ref=$CI_DEFAULT_BRANCH&sort=desc&status=success" | grep -o -E -m1 '"sha":"([^"]*)"' | head -1 | cut -c 8-47)"
if [[ "${#gitsha}" != 40 ]]; then
  echo "SHA '$gitsha' of commit hash is not a valid SHA1 sum" >&2
  exit 1
fi

# Determine all files which have been changed from `gitsha` (exclusive) to
# `$CI_COMMIT_SHA` (inclusive). We only want the name relative to the
# repository's root.
changed_files=$(git diff-tree --no-commit-id --name-only -r "$gitsha".."$CI_COMMIT_SHA")

# Check whether to compile all files because one of the main dependencies
# changed. Otherwise, only the needed files will be compiled.
compile_all=false
for file in $changed_files; do
  if [[ $file == "conf.py" ]]; then
    compile_all=true
    break
  fi
done

if [ "$compile_all" = true ]; then
    make html
else
    for file in $changed_files; do
        # CLEAN_UP_CHANGED_FILES_SO_THAT_ONLY_VALID_SPHINX_INPUT_FILES_ARE_IN_THE_ARRAY
        if [[ ${file##*.} == "rst" ]]; then
            # RUN_SPHINX_BUILD
            sphinx-build -d _build/doctrees . _build/html "$(basename "$file")"
        fi
    done
fi

wait

Even if I could get this script to work, I wonder whether running sphinx-build only on modified files would be sufficient: for example, if the title of a .rst source file is modified, shouldn't the whole table of contents be rebuilt as well?

dbitouze commented 1 year ago

Well, about the sphinx-doc template timestamp, isn't it hopeless to rely on pip cache since, anyway, we run a pip install sphinx?

Wouldn't it be possible to instead rely, not only on a Python Docker image (e.g. python:3.9.17-bookworm), but on a Docker image which already contain an installed sphinx-doc (if at all possible)?

Hooray! That, with other advises given here, does the trick! With mgasphinx/sphinx-html instead of python:3.9.17-bookworm as an image container, only index.html and the .html files corresponding to the .rst source changed files are rebuilt!

picnixz commented 1 year ago

Ah yes, I forgot that we could actually use a docker image for Sphinx itself (I don't use docker much). So should I understand that the original configuration would be ok, but you'd only change the docker image?

dbitouze commented 1 year ago

So should I understand that the original configuration would be ok, but you'd only change the docker image?

AFAICS, it is necessary to additionally rely on git-restore-mtime. A minimal working .gitlab-ci.yml file seems to be the following one:

image: mgasphinx/sphinx-html # Could be another Sphinx Docker image but this one provides a very
                             # recent Sphinx (currently v. 7.1.2) and nice additional themes

pages:
  cache:
    paths:
    - _build/html
  stage: deploy
  script:
  - apt-get update
  - apt-get install git-restore-mtime -y
  # The following command restores the modified timestamps from commits
  - /usr/lib/git-core/git-restore-mtime
  - sphinx-build . _build/html -vv
  after_script:
  - cp -rf _build/html public
  artifacts:
    paths:
    - public
  only:
  - main
picnixz commented 1 year ago

Thank you! I'll update the doc in the following days with perhaps another docker image.

dbitouze commented 1 year ago

Thank you!

You're welcome! Thank you very much for your very detailed answers and your invaluable help!

I'll update the doc in the following days with perhaps another docker image.

What's wrong with mgasphinx/sphinx-html?

picnixz commented 1 year ago

What's wrong with mgasphinx/sphinx-html?

Actually we have an "official" docker image but it's not updated very much (it's only Sphinx 5.2 currently). I'll create an issue for that (so that we could have a nightly build for every release, not sure it's easy to do actually). Alternatively, we could add mgasphinx repository to sphinxcontrib if they are willing to.

@AA-Turner Any thoughts on that ? or do you want to update our official docker image every release?