mkdocs / mkdocs

Project documentation with Markdown.
https://www.mkdocs.org
BSD 2-Clause "Simplified" License
18.39k stars 2.36k forks source link

Improve performance and authoring experience of `mkdocs serve` #3695

Open squidfunk opened 2 weeks ago

squidfunk commented 2 weeks ago

[!NOTE] As the maintainer of Material for MkDocs, I'd like to open a discussion on how we can collaborate to enhance MkDocs. This initiative is inspired by Tom Christie's recent reflections on the future development of MkDocs. I believe that through collective efforts, we can identify and implement improvements that will benefit our users significantly.

The mkdocs serve command provides a powerful write-build-check-repeat loop that is integral for documentation projects, setting MkDocs apart from many static site generators that lack live preview functionality. This feature greatly enhances the efficiency and accuracy of developing and refining documentation, allowing for immediate feedback and iterative improvements.

Startup time of mkdocs serve

Unfortunately, there are significant issues with the mkdocs serve command, particularly when working with large documentation projects that consist of thousands of pages. Currently, mkdocs serve requires a full build of the documentation before it becomes interactive. This process can take an extensive amount of time, ranging from 30 to 40 minutes for large projects. This delay significantly impedes the ability to use mkdocs serve effectively for previewing changes.

The need for a preview is crucial, especially given that Material for MkDocs integrates with the Python Markdown Extensions, a powerful set of Markdown extensions, especially for technical writing, adding features like content tabs via Tabbed and enhanced indent detection through SuperFences. Unfortunately, editor support for these syntaxes is limited, if not non-existent. This lack of support means that authors must rely on mkdocs serve to preview changes. Given the current build times on large projects, authors face considerable difficulty in efficiently making and reviewing changes, essentially working 'blind' without this functionality. Performance is in fact one of the most major critiques on MkDocs.

Problems with the --dirtyreload flag

The --dirtyreload flag in MkDocs offers a partial solution to speed up the re-build process during a documentation project's development by not rebuilding the entire site with each change. However, this flag only affects subsequent builds and does not improve the initial build time. Moreover, it introduces issues such as incorrect navigation and incomplete metadata, which can disrupt the functionality of plugins, like the blog plugin that struggles to correctly update archive and category indexes under --dirtyreload. Consequently, plugins must be designed to specifically work around these limitations, complicating their development and integration.

Conclusion

To significantly enhance the editing experience with MkDocs and reduce the environmental impact by saving thousands of build minutes daily, we need to focus on two critical improvements:

  1. Reducing the initial preview load time: The time it takes from starting the live server to when the preview is first available needs to be substantially decreased. This change would make MkDocs more usable, especially for large projects.

  2. Speeding up live preview updates: After making edits, the time to see these changes in the preview should be minimized. This improvement will support a more efficient and iterative documentation process.

Potential strategies to achieve these improvements include implementing more sophisticated caching mechanisms and exploring the possibility of parallelizing the build process. These changes would address both the initial and subsequent build times, making mkdocs serve a more robust tool for documentation development.

pawamoy commented 2 weeks ago

Are there public examples of large repositories that take up to 30 minutes to build? I tried locally with 10K dummy files and ran out of memory before the site was built :sweat_smile: With 1K files, the template rendering seems to be the most costly.

squidfunk commented 2 weeks ago

Users have mentioned this in multiple occasions, but I'm having a hard time finding it due to GitHub's rather mediocre issue search. Here's what I could gather from a quick search:

The fact is that 30min is a worst case scenario. Even a repeated build that takes 1 minute is too slow to be useful, and --dirtyreload isn't a workable solution due to the problems stated, especially for plugin authors. It also doesn't solely depend on the number of pages, but on the plugins used. Thus, discussing how plugins and the core could better work together to employ caching and reduce build time is a discussion we should start.

Running out of memory is another problem that should be fixed, as already discussed in https://github.com/mkdocs/mkdocs/issues/2669

squidfunk commented 2 weeks ago

This is a project with 3,400 files and a very limited set of plugins, i.e., search, minify and social: https://github.com/openfabr/terraform-provider-cdk-docs

IMHO, not many plugins, and the social plugin which I wrote employs caching, which means repeated builds are much cheaper due to leveraging cached images. I've built the project on my machine, an M2 MacBook Pro:

First build

INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: .../terraform-provider-cdk-docs/site
...
INFO    -  Documentation built in 537.92 seconds

Repeated build (social plugin cached)

INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: .../terraform-provider-cdk-docs/site
...
INFO    -  Documentation built in 487.02 seconds

It's infeasible to make edits on this project without --dirtyreload, which as mentioned is incorrect, plus the author has to wait for more than 8 minutes until the live reload server becomes available. Add a few more plugins and a few hundred more pages and you're up to 20 minutes.

kamilkrzyskow commented 2 weeks ago

I tested the repository mentioned above on my Ryzen 3600 Windows 10 PC, mkdocs==1.6.0, mkdocs-material==9.5.20 First build:

$ mkdocs build
INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: C:\MyFiles\_git\removable\performance-test\site
... A lot of warnings about absolute paths etc., which could also impact performance due to printing to Terminal
INFO    -  Documentation built in 1335.42 seconds

Repeated build: I do not dare to run it again 😅

I used my performance_debug hook. Debug YAML result: performance_debug_first.yml.txt More info about the categories can be found in the gist Python file, but most should be self explanatory, but the amount of files could have generated quite a bit of noise 🤔

  PLUGINS_PER_EVENTS:
    on_post_page|mkdocs_minify_plugin.plugin.MinifyPlugin: 958.97267 # The main culprit of the long build time
    on_page_context|material.plugins.search.plugin.SearchPlugin: 23.14142 # Expected given the amount of files
    on_config|material.plugins.social.plugin.SocialPlugin: 10.61701 # on_config not expected being affected by amount of files, is it always this slow?
    on_page_markdown|material.plugins.social.plugin.SocialPlugin: 1.20333 # magic of concurrency
    ...
    on_post_build|material.plugins.social.plugin.SocialPlugin: 0.00389 # magic of concurrency

Currently the mkdocs serve will invoke the same as mkdocs build, so the benchmark results apply there too. The main issue is with the minify plugin, a much cheaper (performance-wise) minification, of sorts, could be achieved using Jinja2 Environment settings, which I mentioned here, and another approach would be proper enforcement of whitespace management inside the template files, via the %- tags. Perhaps some sort of another minify plugin needs to be released which uses some C/Rust libraries to handle the minification process 🤔

  MARKDOWN_PER_CLASSES:
    pymdownx.superfences.SuperFencesBlockPreprocessor: 11.09541
    markdown.treeprocessors.InlineProcessor: 10.98559

I'm surprised those markdown values are so low, as last time I checked with GMC (~190 files) the same classes had ~6 seconds each. Perhaps the complexity of the Markdown or the amount of Code Blocks has a bigger impact than I thought. But still 3k files vs 200 files and only a x2 time increase seems odd hmm

Template rendering took ~270 seconds:

  TEMPLATE_ROOTS:
    main.html|sum: 267.45968

this time gets repeated each re-serve without --dirtyreload


Caching for later builds with mkdocs serve won't help much, as it immediately turns off the prospective user. Also rendering the whole docs in the background with concurrency seems also like a waste of resources when I only want to check one web page only.

So I would like to see some sort of on-demand loading, serve would only process index.html and later only load pages when navigating to them. This of course breaks the last on_post_build event, as plugins expect all files to be present in the site directory, so invoking it after only a few pages were built could lead to issues. Other events are more agile IMO

I guess this would requires a fork in mkdocs serve and mkdocs build event loops? Rather risky, but would allow for more control maybe? Just a first top of the head idea ✌️

pawamoy commented 2 weeks ago

Perhaps some sort of another minify plugin needs to be released which uses some C/Rust libraries to handle the minification process 🤔

Like this one https://github.com/monosans/mkdocs-minify-html-plugin? Could you build once with it and see if you just spared 950 seconds or so :stuck_out_tongue:? It only minifies HTML files though apparently (but still CSS and JS within them).

pawamoy commented 2 weeks ago

Also, solid work @kamilkrzyskow :+1: Thanks for making and sharing all this!

squidfunk commented 1 week ago

Ah, nice, I didn't know about the minify-html plugin! I'll check it out and probably switch to it. Offloading pure string processing to Rust makes a lot of sense.

waylan commented 1 week ago

Caching for later builds with mkdocs serve won't help much, as it immediately turns off the prospective user. Also rendering the whole docs in the background with concurrency seems also like a waste of resources when I only want to check one web page only.

So I would like to see some sort of on-demand loading, serve would only process index.html and later only load pages when navigating to them. This of course breaks the last on_post_build event, as plugins expect all files to be present in the site directory, so invoking it after only a few pages were built could lead to issues. Other events are more agile IMO

The issue is that the site navigation requires the entire pages collection to be available for the one page to be rendered. This is where caching and/or concurrency would likely be helpful. For that matter, the pages don't need to all be fully rendered, but they all do need to be read and processed to a certain extent to determine the page title, etc for the nav.

And then there are those scenarios where a page's content consists of the pages collection (either be means of a plugin or as a static template). In that case, to render that page (even if the nav is excluded), the entire pages collection is needed.

Ultimately, it has been the above two issues which have thus far prevented a better solution from being developed. Work out a way to address those and then we may have a workable solution.

pawamoy commented 1 week ago

Quick thought: what if plugins informed MkDocs whether each one of their hooks could be executed concurrently, or only sequentially? I'm imagining some utilities to build a "pipeline" of things to run depending on whether they support concurrency or not.

Quick flowchart which doesn't make sense but illustrate the idea:

flowchart TD
    p1f["plugin1.on_files"]
    p2f["plugin2.on_files"]
    p3f["plugin3.on_files"]
    p1n["plugin1.on_nav"]
    p2n["plugin2.on_nav"]
    p3n["plugin3.on_nav"]
    p1pm["plugin1.on_page_markdown"]
    p2pm["plugin2.on_page_markdown"]
    p3pm["plugin3.on_page_markdown"]
    start --> p1f & p2f
    p1f & p2f  --> p3f
    p3f --> p1n & p2n & p3n
    p1n & p2n & p3n --> p1pm
    p1pm --> p2pm & p3pm

EDIT: hmm I suppose there's another possible layer of concurrency on files/pages themselves. The transformation pipeline would likely be quite complex. I'm sick and have fever today so please be indulgent :joy:

humitos commented 1 week ago

Quick thought: what if plugins informed MkDocs whether each one of their hooks could be executed concurrently, or only sequentially?

This is exactly what Sphinx does. Each extension defines if it's safe for parallel reading and/or parallel writing. See https://www.sphinx-doc.org/en/master/extdev/index.html#extension-metadata

I haven't checked how it works internally, but it's probably something to explore a little more and see if there are some ideas that can be reused.

dr-br commented 1 week ago

I would like to be able to use parallel build. It has been stated in #1900 that the benefit is not so high. However, I have lots of jupyter-notebooks to convert (the execute step consumes most of the time). I ended up executing all notebooks concurrently in advance.

tomchristie commented 1 day ago

The issue is that the site navigation requires the entire pages collection to be available for the one page to be rendered. This is where caching and/or concurrency would likely be helpful. For that matter, the pages don't need to all be fully rendered, but they all do need to be read and processed to a certain extent to determine the page title, etc for the nav.

Okay, so I've been working on this and I've got enough to demo now...

https://github.com/mkdocs/sketch/tree/main

That's a work-in-progress of "how could mkdocs look" that properly deals with this issue.

Specifically, the mkdocs serve command doesn't require a site build at all*

I needed to do a bit of poking to make this work with the terraform example above (since it doesn't include a nav config), tho once I'd done there serve startup time was under a second.

There's other aspects that I'm looking to address as part of that work, just getting things into shape so that I've got a coherent body of work to start sharing here.

search indexes aren't in there just yet. yes they would* require a full-site build, but we can use HTML rel=preload links to prompt them in the background, and likely also have per-page caching.

pawamoy commented 9 hours ago

Nice work @tomchristie!

  • search indexes aren't in there just yet. yes they would require a full-site build, but we can use HTML rel=preload links to prompt them in the background, and likely also have per-page caching.

In the case of mkdocstrings and its cross-references ability, rel=preload wouldn't be enough. To statically resolve a cross-reference, we must wait for all pages to have been built. The only way to make cross-references work when serving pages on the fly (without building everything) would be to inject some Javascript magic :thinking: Like, the plugin would store query-able state in the server, that the client could continuously request, until all needed pages were loaded with rel=preload and the unresolved references on the current page can be resolved :thinking: And since we don't know which pages are needed to resolve a reference, all pages would have to be pre-loaded anyway :thinking: (or, if not all, maybe most pages, with a priority order or something).

squidfunk commented 8 hours ago

This looks really promising! Really excited how this will work with more complex setups. I guess there're still things to be worked out (haven't checked the implementation), but it's a great start! 👏