Closed jfischoff closed 3 years ago
Could this work be avoided by shifting how we interact with Haddock as we discussed for a bit?
The direct serving of tarballs of documentation as generated by Haddock is too rigid for what we'd really like to be able to do - just getting a parsed AST back so the content can be rendered in a template of our choosing would be more ideal. This implies that we need build no docs, only run the parser on the source in the tarball and validate that.
What you are referring to are the results of the doc, which ideally would not be HTML, but some intermediate format from Haddock.
But what regardless what is produced we will still need a plan to coordinate the work, and make sure that only one version of the docs is uploaded.
We haddock is modified to produce an intermediate format, then we can have non-trusted, i.e anyone build the docs. At this point in the future, it will more important only a small number of the docs that are built are uploaded since we will have more builders and thus more bandwidth from them.
@jfischoff what I'm proposing wouldn't require a dedicated doc builder or asynchronous process at all, it would happen on upload. It wouldn't require building the package at all.
@bitemyapp Won't Template Haskell stand in the way of that goal?
@bitemyapp Ah I see. I also purposed this, and here are some reasons it is not ideal:
The other reason to use doc builders, is we will package builders online that can do the work for us, so we might as well use them.
@bitemyapp I don't think we can avoid actually doing more than merely parsing, as (beyond of stuff like TH and CPP) you also need to be able to perform name resolving (to find out where something reexported really comes from) as well as type-inferring if top-level type-signatures are incomplete. Also, some packages rely on non-trivial cabal configure
steps to create code and headers files, which then may affect the resulting code.
That is unfortunate, but makes sense. This should be useful regardless then, even if we can extract information via Haddock post-build later.
Also note that this plan covers coordinating builder clients that give us general build results, not just docs. So even in future if we can get haddock to produce data and have all clients upload that, this system is still useful for coordinating a pool of builder clients (think checking if things build & test results with various ghc/hp versions across a range of platforms).
Doc Builder Plan Notes
An attempt to record some conversations with @dcoutts
The idea is to ask all the (trusted) clients to build docs when building the packages. The clients upload a build report after the build has finished, with multiple components, one is a doc build report. Failure to receive a report after a timeout, will be is record as a failure.
The server will maintain a table for build results which could include doc result. Let's call it the Build Results table. It is possible this information is already stored in another table.
The server has a list of which platforms it prefers for the canonical copy of the docs for a package. We will call this the Preference List and it will stored in acid-state also. There will be CRUD operations for it from a REST API
Preference List
["ghc-7.6-windows-x86_64", ...]
The server will maintain a table in acid-state that has the docs it is looking for from the clients. The Doc Want table.
The table will have two columns
When a package is queued to build, the highest precedence entry is added to the Doc Want table.
When a client reports a failure:
It is possible that all three steps must be part of a atomic transaction, to keep the server from being in an invalid state.
Periodically the clients will poll the Doc Want table via a GET and upload the instances of the docs that they (the build clients) have that the server also wants.
When the Docs have been successfully uploaded they are removed from the Doc Want table, and success is returned to the client.
There optimizations that can be made, but for the first pass it's important it is correct, durable, and resilient to failure.
One area I am worried about, is the possibility that the server will have an entry in the Doc Want table that has failed.
I think some more thought can help prevent this, or reduce the probability, but it is also worth thinking of a system that can detect this invalid state potentially in production.