sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.25k stars 2.04k forks source link

Race conditions when running multiple Sphinx instances using the same doctree #2946

Open mgorny opened 7 years ago

mgorny commented 7 years ago

Projects using Sphinx (and Makefiles) often define multiple targets for different doc builders. For example, the LLVM project defines a target for manpages and for HTML docs. Since those targets are independent, make is allowed to run them in parallel (and does that). Since they use the same source tree, they use the same doctree root as well.

Lately, I've noticed that the two parallel Sphinx instances 'fight' over the doctree. Basically, it seems that they both ignore each other and create the doctrees independently, i.e. the same files are first written by one Sphinx instance, and afterwards overwritten by the second. While normally this doesn't cause problems (except for being awfully inefficient), I've actually seen a race condition causing major failure (sorry, don't have any logs for it). Basically, one Sphinx instance has opened one of the files for reading while the other was writing it.

We've attempted to find a good solution on the build system side but could fine none satisfactory. If we switched to separate doctrees, that would be a waste of a good caching opportunity. Getting two independent targets serialized is not trivial in CMake — and LLVM people aren't happy with me adding fake dependencies to enforce serialization. But even if we did that, there's still the general problem that other people will be unaware of the issue and will be hitting it in the future.

Therefore, I think it would be best to fix it on Sphinx side. The simplest solution would be to lock some file in the doctree directory. Considering the specific design, the environment pickle should be the best file to lock. However, that would require redesigning the application to keep the file open rather than closing it immediately after reading/writing (which would also be beneficial to avoiding race conditions).

mgorny commented 7 years ago

Well, I've tried to give it a shot… and I can't find a sane way of doing it with the current design.

The big problem is, the use of pickles is done quite inconsistently. The pickled environment is first read in Sphinx._init_env() which is called by Sphinx.__init__(), i.e. pretty early. However, it is not updated until Builder.build(), and there it is pickled afterwards. So to get the results I need, I would have to lock the environment from Sphinx._init_env() to Builder.build().

To do that, I would have to keep the environment file open during that period. That's quite doable — I was thinking of opening it directly in Sphinx._init_env(), and keeping the open file in BuildEnvironment. This would have the extra advantage of not having to repeatedly pass the path around. In this case, Sphinx._init_env() would open the file r+ (i.e. open R/W or create), and lock it. BuildEnvironment.frompickle() would just read the open file (if non-empty), and BuildEnvironment.topickle() would rewind and write it. Then it could be unlocked and closed.

The problem with all that is that Python doesn't do RAII properly, so there's no clean way of ensuring that the file will be closed on exception without redesigning how Sphinx is used. Probably the whole class would have to be turned into context manager, and used via with like files are usually opened. Otherwise, I'm afraid that some corner case may result in the file being kept open (and locked), and deadlocking some other process.

tk0miya commented 7 years ago

Agreed. As you said, sphinx is not designed to be able to invoke parallelly. Is there any reason not to use -j option of sphinx-build? I think the option will help you.

About locking, period from BuildEnvironment.frompickle() to BuildEnvironment.topickle() is very long. I wonder does it really helps parallelism. The second sphinx process waits until the first one finished reading all documents.

mgorny commented 7 years ago

Agreed. As you said, sphinx is not designed to be able to invoke parallelly. Is there any reason not to use -j option of sphinx-build? I think the option will help you.

The problem is that we are actually using two different builders (manpage and HTML), so -j doesn't really help here. Even if Sphinx had an option to use two different builders in one run, that wouldn't integrate well with build systems where it is desired to have separate targets.

About locking, period from BuildEnvironment.frompickle() to BuildEnvironment.topickle() is very long. I wonder does it really helps parallelism. The second sphinx process waits until the first one finished reading all documents.

Well, lack of parallelism is not that much of an issue as the risk of race condition resulting in malformed documents or a crash.

zenhack commented 6 years ago

fwiw, this affects https://github.com/pazz/alot as well. I maintain the aur package and am currently just passing -j1 to make. -j1 isn't a big deal for alot, because it's only using make for the docs, but it took a little bit of work to track down what was going on, and the output of a user's build log suggests reporting a bug if for nothing else better error messages:

https://ci.virtapi.org/job/Arch_Package_alot/35/console

ehabkost commented 4 years ago

This is affecting the QEMU project as well. See the build failure at: https://app.shippable.com/github/ehabkost/qemu/runs/24/8/console