readthedocs / readthedocs-build

Work in Progress builder
19 stars 25 forks source link

refactoring rtd-build #6

Closed gregmuellegger closed 8 years ago

gregmuellegger commented 9 years ago

I basically re-wrote the complete rtd-build cli. Here is a current backlog of open tasks:

All implemented things are tested very well. Tests can be run with tox. Instructions in the README have been updated.

To run a quick sample of the builder, do:

cd integration_tests/minimal_project/
rtd-build

This will output the rendered docs in _readthedocs_build/docs/html/.

Currently we support the following config options in readthedocs.yml:

name: docs  # required, arbitrary
type: sphinx  # required, only sphinx is supported at the moment
base: docs-en/  # the base directory of the source files
gregmuellegger commented 9 years ago

I have a conceptual problem with allowing multiple configurations. We have the problem that we need a different output directory for every config. So when having two readthedocs.yml, the project configuration in them has to be rendered into different directories.

To make this possible the current implementation requires the name option to be set in the config. This name will be used as part of the output path. So a project configured with name: mydocs will render into _readthedocs_build/mydocs/html/.

The problem we will have on the readthedocs.org then is to decide which name to choose for serving the projects docs. We could make docs the default name and require at least one configuration with this name per repository. However what is the conceptual point of having multiple configs when we eventually can serve only one of them on rtfd.org?

/cc @agjohnson @ericholscher

ericholscher commented 9 years ago

The idea with multiple configs is that you can choose which one to run -- we wouldn't support outputting multiple sets of docs for each run. You could have a setting on RTD.org that points at the specific YAML file you want to build. The main thing to support is repos that have multiple sets of docs in them, not one set of docs splintered into multiple readthedocs.yml files.

ericholscher commented 9 years ago

I'm curious how you're planning to integrate this into RTD? My understanding is that we were going to have readthedocs-build work downstream of the VCS & virtualenv code, and not handle that set of code. With the current implementation, we're going to have to totally replace the RTD.org builder code to support this, or else we'll have 2 concurrently running sets of code to do all the VCS/Venv/etc code.

agjohnson commented 9 years ago

I agree, this seems to be moving a lot more than we wanted into rtd-build. Before we tackle a refactor this large, I'd rather we defined exactly what the scope of rtd-build would be and what how it makes sense to incorporate it back into RTD.

gregmuellegger commented 9 years ago

Ok, sorry if I missed the general task. I wrote up the specs in #5 in order to clear things out and got positive feedback that's why I thought I'm on the right track with this :)

The spec say at the top:

Creating a build works like this:

- change into the root of the project that you want to build the documentation
  for
- run ``rtd-build``

``rtd-build`` will then perform the following actions:

- it searches for all ``readthedocs.yml`` files below the current directory
  and merges all found files into a list of configurations
- it iterates over all configurations (order is not garuanteed) and performs
  the following actions for each:

  - create a fresh virtualenv
  - ``cd`` into the base directory of the documentation
  - ``pip install ...`` whatever is configured in the config
  - ``python setup.py install`` if configured (from the path specified in
    ``python.setup_path``.
  - run ``sphinx-build``

and that's what I pursued with the refactor. Let's talk about how we should proceed there.

I'll try to describe the plan for integrating with readthedocs.org with the steps that should be performed from zero to hosted docs:

All rtd-build steps can happen in an isolated Docker environment in order to encapsulate the commands, to prevent any malicious code to break out. The parts of the build system that are left in readthedocs.org do not execute any files or config from the user repository.

ericholscher commented 9 years ago

That all sounds great. I think we should delete the VCS code from this repo then. Looking more closely at this PR, I see it was mostly touched in rename operations, but seeing it in the changeset made me wary.

One large concern that I have is making sure that we can capture output for every command properly, and display them to users. Historically, our build output has been quite awful, and the work Anthony did with https://github.com/rtfd/readthedocs.org/pull/1563 fixes a lot of that. I want to make sure that we don't lose that visibility with this work.

The other one is that Docker doesn't have a great way to handle long-running commands and IPC between the parent and child commands. We've solved this by wrapping every command in a specific docker environment, instead of a long-running set of commands. So we need to make sure that for RTD use, we can get access to the specific commands via library calls, and can run those in a specific environment.

Along the lines of not repeating past mistakes, I think that the "system site packages" is a mistake in our design, and should be removed. We should also consider if there are other options and approaches that should be considered in this work. One of the other big ones that I'd love to solve is the build process depending on layout.html, which breaks a common Sphinx paradigm: https://github.com/rtfd/readthedocs.org/issues/152 -- It's a tricky issue in general, but if we can find another way to force content into the pages, we should. I think doing this in an extension that the builder forces, and we iterate over every doctree and add nodes directly into the Docutils Doctree is likely the "purest" solution.

ericholscher commented 9 years ago

Similar things with the doc_builder subsystem. If we aren't using that in this refactor, it should be removed as well. These are the main things I copied over from RTD, and have changed significantly, and I was worried about living in parallel.

Looking at the scope of this work, it seems quite reasonable -- but I want to make sure we are progressing properly and without a bunch of old baggage. We have also talked a lot about whether we want to shell out to sphinx-build or run sphinx by importing it as a library in the builder, which we should also discuss the merits of before we progress too far down a particular path.

gregmuellegger commented 9 years ago

command output

Yes, the changed UI looks very nice and is so much more helpful. The plan for nice commands output would have been that the rtd-build tool prints the command it runs under the hood and the output, prefixed with command name, like:

Reading readthedocs.yml

virtualenv --python=/usr/bin/python2.7
virtualenv: New python executable in tmp-72051fbc4e8daff/bin/python2
virtualenv: Also creating executable in tmp-72051fbc4e8daff/bin/python
virtualenv: Installing setuptools, pip, wheel...done.

sphinx-build -b html docs/ _readthedocs_build/html/
sphinx-build: Running Sphinx v1.3.1
sphinx-build: loading translations [en]... done
sphinx-build: loading pickled environment... done
sphinx-build: loading intersphinx inventory from http://python.readthedocs.org/en/latest/objects.inv...
sphinx-build: loading intersphinx inventory from http://sphinx.readthedocs.org/en/latest/objects.inv...
sphinx-build: loading intersphinx inventory from http://django.readthedocs.org/en/latest/objects.inv...
sphinx-build: building [mo]: targets for 0 po files that are out of date
sphinx-build: building [html]: targets for 11 source files that are out of date
sphinx-build: updating environment: 2 added, 12 changed, 1 removed
sphinx-build: reading sources...

But Anthony's UI is ofcourse much nicer. We could make the rtd-build output available as JSON, so that it prints a JSON object to stdout, like:

{"commands": [{"command":"virtualenv --python=/usr/bin/python2.7","exit_code":1,"output":"New python exe..."}, ...}

This could then be used on the rtfd.org build page to show the nice UI. The only drawback I see here is that we cannot display the output in realtime since we need the commands to finish to get valid JSON that we can print.

docker & IPC

What is the difference between one long running build command and a few nearly as long running commands? The pip install and sphinx-build commands will probably take longest. What is the benefit of executing them separately?

I'm not sure yet how we could make the individual subcommands exposable to a caller of rtd-build. That doesn't feel very elegant, so I would like to know more about what the actual problem is before exploring possible solutions.

injecting custom html in the built

I would suggest a solution that is completely unrelated to Sphinx. Iterate over all generated .html files and in the build directory after sphinx has finished and parse the files with BeautifulSoup, inject the script at the bottom of the body and add the canonical URL into the head etc. This seems easy to implement and very robust. Plus it will be build system agnostic and work with Sphinx, MkDocs, AsciiDocs etc.

This step could either live in rtd-build or in readthedocs.org itself as an after-build step. I would recommend leaving this logic in readthedocs.org as this is the repository that will also contain the JS files that get injected.

legacy vcs code

The doc_builder and vcs_support directories are only there for reference so far. They are not used anywhere in the new rtd-build. I will delete them right away to avoid any further confusions :)

gregmuellegger commented 9 years ago

Regarding running sphinx by importing it: I'm a clear -1 on using Sphinx as a library. We should run it via it's command line interface. The main reason is that every user is using Sphinx via the CLI. If we use it as a library we introduce a lot of potential to use it differently than our users do which results in compatibility issues. For example we might miss updates to how Sphinx calls the internal APIs when executing the CLI client.

ericholscher commented 9 years ago

command output

The main issue with capturing command output that way is it precludes the ability to do any kind of real-time output in the future. If we run each command seperately, we can stream the output from that specific command, whereas returning it as JSON means that we effectively buffer all output until the end of the run.

docker & IPC

Same general issue here. It isn't a technical issue of having a long-running process, it's that we want to minimize the logic that is running inside of docker. With rtd-build abstracting a lot of that, it makes it less of an issue.

Injecting HTML

True. It feels like a dirty solution, but I guess it is the only way to make sure it works everywhere -- however there are still subsets of information that require the Jinja template context, and must be rendered within the actual build process. Things like the page name, specifically. So we can't totally work around this, though I guess we could derive it from the outputted filenames.

Sphinx import

Sure, running sphinx-build on every build makes sense, which is why we've historically done it. The main issue there is that we're running it 5 times on every project import, which is a huge waste of resources. For JSON & HTML output, for example, we read the files from the FS, parse them, transform them, and then render output, even though the first 3 steps are pretty much identical. The goal with doing it in Python was to have greater control over the build process so that we can make it more efficient.

agjohnson commented 9 years ago

Here are some of my thoughts on all of this. I feel like we are turning a lightweight module into a kitchen sink module here, so it might be worth stepping back first to see what we're building.

In it's two forms, this module could be either:

Module A

This module is more strictly to ease the efforts of building documentation, both locally and in our environment. It doesn't need to know about how to build in docker build environments, because that is an RTD implementation detail -- though knowing how to build in docker containers is not a drawback necessarily. This tool also doesn't need to know how to report build commands or interact with RTD at all. It simply knows how to build documentation through sphinx/etc, and eases this process. It is the most lightweight wrapper we can make that pushes interaction with sphinx into a tool that provides end users with more reproducible builds.

It would execute sphinx/etc as a module -- as this is an important next step to cleaning up how sphinx build is implemented. It would shorten build times as @ericholscher mentioned, which we've been running into more often with sphinx-autoapi. Controlling sphinx as a module also gives us the ability to control sphinx internals and inject magic to make the process, making things like extension usage more magical. A readthedocs.yaml file would configure pieces of projects inside RTD, but also provide options to sphinx-build and conf.py overrides, to make local builds easier to replicate.

RTD would execute this as a cli command as part of the build process, treating output and error handling as it normally would.

Virtualenv support could be built in, or could just as easily be provided by manual tooling. All we are trying to enforce with vcs included in this tool would be fresh virtualenvs -- a use case that a tool like Tox solves far better.

Module B

This work seems to be leaning towards moving all build functionality into a separate module. This could provide a purer virtualenv for building locally, and theoretically reduce support burden around virtualenvs and python package issues. It would require duplicating and refactoring a lot more code, however. Additionally, there will be some forking of builder code and a duplication of effort until we can unify the two systems again.

There would be benefits to moving building code out of RTD, but I'm not sure they outweigh the work that would have to go in at this point, or if this implementation would be any more useful to users than the implementation above.

Implementation

So, it's worth discussing the following where RTD and this module would interact:

My thoughts on implementation details:

Anyways, just my thoughts, happy to have input on all of this.

gregmuellegger commented 9 years ago

using sphinx python api

I'm sold, the performance argument was not obvious to me. +1 on this.

two scenarios described by @agjohnson

I want to summarize the "Module A" and "Module B" sections in my own words to see if I understand it correctly:

Module A, rtd-build is a solely a wrapper around sphinx, injecting RTD specific sphinx behaviour. It only uses parts of the readthedocs.yml which directly influence the way how sphinx is called internally. All other configs in readthedocs.yml (like pip requirements) would be handled on rtfd.org

Module B, rtd-build is a replication of the pipeline we currently have on rtfd.org, including VCS logic, virtualenv handling, running commands inside docker.

My understanding of rtd-build is somewhere in between those scenarios. It should be a function to turn the project sourcefiles into built documentation artifacts. In other words: rtdbuild(sourcefiles) -> html

It therefore abstracts away what happens during this, but nothing more. Its the only tool knowing about the readthedocs.yml config. rtfd.org should not know about this config. However the project handling like doing the VCS checkouts, executing the code in a container, storing the command output etc is something that rtd-build does not know about.

So in order to detect the logic that needs to go into rtd-build, we need to look into what it takes to convert sourcefiles into a built. And these are basically the steps that every user currently performs himself by hand in the project if he wants to build the docs locally:

In order to fullfill these things, we require virtualenv handling inside rtd-build, the logic to run sphinx, and logic to place the artifacts into a predefined place.

So my vision is also a very lightweight wrapper around Sphinx + virtualenv handling. The only reason virtualenv is in there is that we do need to support autodoc. Once the autoapi which doesn't use dynamic imports takes over, we can completely drop virtualenv handling and are left with what @agjohnson suggested in Module A.

IPC

So if I'm getting this right, with IPC you mean that the command output is transferred from inside docker to a monitoring system outside. I don't understand yet what the drawback of a longer-running docker command is. You say that you loose all output of the failing command. That would mean that you only see the output of succeeding commands and the benefit of splitting it up into multiple commands is that you will see the you will know which command has failed, but not how (as you don't have the output).

Regardless of my (probably weird and incorrect) understanding of this I have a suggestion for monitoring the command output. @agjohnson said:

Proper IPC with sockets or something like ØMQ would be preferable, but would be more complicated than separate commands.

If we do something like this, this definitely should live outside of rtd-build. It could be implemented as a wrapper command so that the actual command the is run inside docker is:

rtd-track-output rtd-build

Then rtd-track-output will record stdout/stderr of rtd-build and redirect it to ZeroMQ, a socket, a logging server, etc.

Sorry if I'm a bit novice on the docker side. I always thought docker run ... will output to stdout and we can record this output from outside docker however we like.

gregmuellegger commented 9 years ago

Reading your comments about IPC, I wonder if I'm getting the situation right how docker would be used here.

I assume the following things for use with rtd-build:

Is this correct?

agjohnson commented 9 years ago

Regarding Docker usage workflow, your assumption is about correct. This used to execute a long running command and report back via JSON over STDOUT, but now it runs a noop command that dies after X seconds and commands are executed in background processes using Dockers' exec_create et al.

Relying on a long running command definitely means the container needs socket communication to constantly ship updates. When I say we don't get failure, I mean a failure at the container level -- if the command in the container is OOM killed for instance -- will result in no output. So we can't get around using sockets. The current execution scheme makes this easy because Docker's api abstracts that and gives metadata on the inner command failures without having to prepare the container for it.

We can use STDIN/STDOUT, but this is a lazy socket, and one that is fragile. We would have to handle streaming and (unbuffered?) serialized socket data, not to mention killing all spurious output to ensure nothing will inject into STDOUT and ruin our output. This introduces complexity to debugging and setup. We could add socket communication to this, but now we are building host and client socket communication on top of this, not to mention orchestrating docker to use unix sockets or opening up tcp ports.

I am -1 on long running commands inside containers not because it doesn't work, but because it bloats the docker container and the container's responsibilities for seemingly no benefit over background executed commands. All this work above only gives us an approximation of feature parity with what we get with background processes and the Docker API.

In terms of what this tool is, I think what you describe for this tool is mostly in line with what I picture. I need to think more on this one. If we make this into a thin wrapper around sphinx, with no knowledge or care of docker/it's environment, there are some places where the integration get cumbersome. I still think this is the direction we need to go, but need to come back to this and map out those few points.

I don't think virtualenvs need to be part of the tool still however. We can certainly still manage dependencies and even simplify dependency installation, but I feel like virtualenv management is bloat to this tool that users might not be expecting. Also, though in our ideal world, no one is executing code to inspect docstrings, I don't think we'll ever get 100% away from executing code or having virtualenvs, so this will always likely live somewhere.

Anyways, I'll come back to this shortly.

gregmuellegger commented 9 years ago

It has been a while since we talked, but as far as I can see from the last posts, we have three main points to discuss. 1. whether we include virtualenv logic inside rtd-build, 2. how we are monitoring the build logs, 3. if the build can be split into multiple smaller commands.

1. virtualenv logic

I would like to keep the virtualenv logic inside rtd-build. Otherwise it's going to be really difficult to reproduce builds on the user's computer that fail on readthedocs.org. I also talked to @ericholscher about this and he seems to be inline with my view that it's not a big burden to manage a virtualenv with rtd-build. The code will be quite minimal and I don't see a risk that it will become a maintainability problem as the domain of the code is very narrow.

2. monitoring output

As you describe it seems to be necessary to communicate with a process outside of docker for the stdout monitoring. As I expressed in a previous comment I prefer having a command that can wrap the rtd-build calls like: rtd-track-output rtd-build. But if you see the stdout between rtd-build and the wrapping rtd-track-output as unreliable, than there is propably no way to keep the socket-logging logic outside of rtd-build.

In that case I'm ok with putting the monitoring logic into rtd-build itself.

3. multiple small commands

@ericholscher explained to me in person again why it's preferred to use multiple small commands to make the build instead of using one command. He mentioned mainly that it's otherwise not practical to limit resouces on the individual steps of the build. That makes sense to me and I want to support having multiple commands.

Therefor I propose that running the rtd-build command still creates the full build from start to end. However giving the --step option makes it possible to define a small part of the build that should be run next.

For example:

rtd-build

should be equivalent to

rtd-build --step=virtualenv-setup  # performs `virtualenv` and `pip install sphinx ...`
rtd-build --step=requirements-install  # performs pip install -rrequirements.txt
rtd-build --step=package-install  # performs python setup.py install
rtd-build --step=build-html  # builds HTML docs with sphinx

# builds pdf docs with sphinx, will be a noop if no pdf builds are configured in readthedocs.yml
rtd-build --step=build-pdf

# builds epub docs with sphinx, will be a noop if no epub builds are configured in readthedocs.yml
rtd-build --step=build-epub
rtd-build --step=build-search-json  # builds json with sphinx

That way the user can produce a build locally in the simplest way possible with rtd-build, but also can reproduce the exact procedure from the build server as well by issuing the steps manually in the correct order on the command line.

agjohnson commented 9 years ago

So my main worry here is that we're building too much into this tool for the sake of refactoring. We're focusing more about what this tool means for our implementation than what this means for authors. This tool was supposed to be a tool for authoring, but I don't feel like we're giving authors a reason to use this.

So what problems do our users have?

There's probably some more obvious questions there.

Either way, my thoughts on this are that if it isn't of use to authors, then we shouldn't be complicating this tool with it. I feel the issues around dependencies and reproducible builds can't be solved by virtualenv and dependency management. Without the need for virtualenv management, there is no need to make this tool execute commands via subprocess/docker. I think communicating back to RTD is also superfluous, and building any sort of IPC between docker would be solidly out of scope.

This greatly reduces the scope of the tool, and hopefully allows us to focus more on the author experience, rather than refactoring.

In my mind, this tool:

I think exposing the standard Makefile targets as positional arguments to rtd-build would be a good idea. So building on top of your idea:

rtd-build html
rtd-build epub
rtd-build pdf

This would feel very natural for authors perhaps?

agjohnson commented 9 years ago

Also, it's worth mentioning that there are many better ways for authors to test in a pristine environment -- tox is one, CI tooling is another. We're not going to depend on these tools, but maybe our answer for Python developers that need to test their docs in a pristine environment is to use one of these methods.

Eventually, there are some users that we'll want to guide towards using CI tooling anyways, which aligns with our idea of eventually being able to ship built documentation with this tool.

gregmuellegger commented 9 years ago

@agjohnson ok this sounds all reasonable now.

The virtualenv handling is then something that is configured in the YAML file but is not handled by the rtd-build tool. That will require us to parse the YAML file in the build system on readthedocs.org but we can ofcourse use the rtd-build source as a library for this. No big deal here, we just need to keep it in mind.

ericholscher commented 8 years ago

Going to go ahead and merge this, as it's way more up to date and useful than the current state of the repo. Will re-open this as a ticket with TODO's.