haskell / hackage-server

Hackage-Server: A Haskell Package Repository
http://hackage.haskell.org
Other
414 stars 198 forks source link

Doc Builder Plan Notes #282

Closed jfischoff closed 3 years ago

jfischoff commented 9 years ago

Doc Builder Plan Notes

An attempt to record some conversations with @dcoutts

The idea is to ask all the (trusted) clients to build docs when building the packages. The clients upload a build report after the build has finished, with multiple components, one is a doc build report. Failure to receive a report after a timeout, will be is record as a failure.

The server will maintain a table for build results which could include doc result. Let's call it the Build Results table. It is possible this information is already stored in another table.

Id Package Identifier Platform Build Result Tests Result Doc Result
1 text-1.0 ghc-7.6-windows-x86_64 Success Success Fail

The server has a list of which platforms it prefers for the canonical copy of the docs for a package. We will call this the Preference List and it will stored in acid-state also. There will be CRUD operations for it from a REST API

Preference List ["ghc-7.6-windows-x86_64", ...]

The server will maintain a table in acid-state that has the docs it is looking for from the clients. The Doc Want table.

The table will have two columns

Package Identifier Platform
text-1.0 ghc-7.6-windows-x86_64
... ...

When a package is queued to build, the highest precedence entry is added to the Doc Want table.

When a client reports a failure:

  1. First the failure is logged in Build Results table.
  2. If it exists in the Doc Want table, it is deleted.
  3. The next platform that is not already logged as a failure in the Doc Build Results table is added to the Doc Want table.

It is possible that all three steps must be part of a atomic transaction, to keep the server from being in an invalid state.

Periodically the clients will poll the Doc Want table via a GET and upload the instances of the docs that they (the build clients) have that the server also wants.

When the Docs have been successfully uploaded they are removed from the Doc Want table, and success is returned to the client.

There optimizations that can be made, but for the first pass it's important it is correct, durable, and resilient to failure.

One area I am worried about, is the possibility that the server will have an entry in the Doc Want table that has failed.

I think some more thought can help prevent this, or reduce the probability, but it is also worth thinking of a system that can detect this invalid state potentially in production.

bitemyapp commented 9 years ago

Could this work be avoided by shifting how we interact with Haddock as we discussed for a bit?

The direct serving of tarballs of documentation as generated by Haddock is too rigid for what we'd really like to be able to do - just getting a parsed AST back so the content can be rendered in a template of our choosing would be more ideal. This implies that we need build no docs, only run the parser on the source in the tarball and validate that.

jfischoff commented 9 years ago

What you are referring to are the results of the doc, which ideally would not be HTML, but some intermediate format from Haddock.

But what regardless what is produced we will still need a plan to coordinate the work, and make sure that only one version of the docs is uploaded.

We haddock is modified to produce an intermediate format, then we can have non-trusted, i.e anyone build the docs. At this point in the future, it will more important only a small number of the docs that are built are uploaded since we will have more builders and thus more bandwidth from them.

bitemyapp commented 9 years ago

@jfischoff what I'm proposing wouldn't require a dedicated doc builder or asynchronous process at all, it would happen on upload. It wouldn't require building the package at all.

lambda-fairy commented 9 years ago

@bitemyapp Won't Template Haskell stand in the way of that goal?

jfischoff commented 9 years ago

@bitemyapp Ah I see. I also purposed this, and here are some reasons it is not ideal:

  1. The docs uploaded might not match the package
  2. The docs uploaded might not be built on the platform we want to serve
  3. We want to know if the docs can build on different platforms. I could even see serving many versions of the doc ultimately.

The other reason to use doc builders, is we will package builders online that can do the work for us, so we might as well use them.

hvr commented 9 years ago

@bitemyapp I don't think we can avoid actually doing more than merely parsing, as (beyond of stuff like TH and CPP) you also need to be able to perform name resolving (to find out where something reexported really comes from) as well as type-inferring if top-level type-signatures are incomplete. Also, some packages rely on non-trivial cabal configure steps to create code and headers files, which then may affect the resulting code.

bitemyapp commented 9 years ago

That is unfortunate, but makes sense. This should be useful regardless then, even if we can extract information via Haddock post-build later.

dcoutts commented 9 years ago

Also note that this plan covers coordinating builder clients that give us general build results, not just docs. So even in future if we can get haddock to produce data and have all clients upload that, this system is still useful for coordinating a pool of builder clients (think checking if things build & test results with various ghc/hp versions across a range of platforms).