RFC: Registry UI - Githubissues

janosdebugs commented 2 months ago

Abstract

The registry needs a user interface to display documentation to users and make it searchable. This RFC will detail the technical implementation of the registry UI for the purposes of making a decision.

Design considerations

During the preliminary discussions, we have outlined the following criteria:

The registry must only expose properly licensed files. Providers that do not exhibit an OSI-approved license should not have their documentation exposed.
The UI should be updated alongside the registry itself. If a new version of a provider or module is added to the registry, the documentation should be updated too.
The UI should have an almost instantaneous feel to it, users should prefer to use it for documentation.
The UI should be indexed by search engines.
The UI should require very little maintenance.
The UI must make sure that the markdown parser does not allow in HTML tags that could potentially be dangerous (e.g. script tags). It may be preferable to not allow HTML tags at all.

Background research

To evaluate the size of the registry, I have downloaded about half of all providers present in the OpenTofu registry. The docs folders amounted to roughly 6 GB of data and 850k files. By extrapolation, this would mean that the registry UI would be roughly 10-15 GB in size and contain 1.5-2M files if hosted statically, assuming one markdown file corresponds to one HTML file. Packing all files related to a provider/version is not feasible because some providers (AWS, Azure, etc) contain hundreds of files and would negatively impact loading times.

Front-end technology

When it comes to front-end technology, we have three options:

Static files with minimal JS for search functionality. This is the simplest solution and the least likely to break. It also allows for heavy optimization for loading times.
Server-side rendering using Cloudflare workers and caching. This is more complex, hard to test, but saves us from having to pre-generate HTML files and potentially reupload the entire registry UI if we change the layout.
SPA loading from GitHub. The raw.githubusercontent.com domain has CORS headers set, so this would be feasible. We would only need to index the repositories.
SPA that loads markdown files from an R2 bucket. This would require us to upload the markdown files and keep them in sync.
XML data files with XSLT style sheets. It's a remnant from a bygone era, but browsers support it and Google indexes it. This would allow us to change the layout as we please without having to recreate all HTML files but still enjoy the benefits of static files. It would cause one additional request to load the XSLT file.

Given the performance and search indexing requirements above, solution 1 seems to be the most appropriate, although solution 3 is definitely intriguing because it would require very little maintenance.

Front-end optimization

HTTP/2 promised that it would make front-end loading times faster with Server Push, but it has several problems and hasn't gained wide adoption. It has since been removed from browsers. The alternatives to server push require extra round trips.

It is worth noting that using external sources, such as linked CSS files may cause caching issues across such a vast number of HTML files and therefore, CSS files should either be inlined or strictly URL-versioned.

One of the alternatives that we should make use of is inlining critical resources, such as CSS files. Since the registry UI can be fairly minimalistic in its appearance, inlining resources is feasible. Incidentally, this technique also bypasses caching issues with linked resources.

Build process

In order to build the HTML files, we should create a Go library that does the following:

Perform a sparse clone of the given registry. It should also provide the option to use an already existing clone for bulk updates.
Sparse checkout the LICENSE/LICENSE.txt/LICENSE.md files, as well as the docs/ and/or website/docs folders.
Perform a license detection. I used go-license-detector in the past and it seems satisfactory.
Render the markdown. Hugo uses Goldmark because it's extensible in how it parses and renders HTML. Hugo also uses Chroma for pre-generated syntax highlighting, which is beneficial for performance. For safety, we may want to re-parse the HTML and sanitize for disallowed tags and attributes. We should also make sure we add CSP meta tags to prevent some classes of attacks.
Embed the existing HTML into our template.
Upload the resulting HTML files to an R2 bucket, cleaning out any files that should no longer be there. We can do this externally, or build it into the application.

For development purposes, we may want to provide a separate binary that lets provider authors start this in their local development version.

Costs

Based on the R2 pricing, the initial upload and any full refresh will cost us 5-10 USD and the storage will cost us $0.015 per month.

cube2222 commented 2 months ago

I'm personally not a huge fan of pre-generating all pages, as then each change we do to the frontend will:

Take a while to deploy, as it has to download 10 GB worth of markdown files, regenerated them, and then upload them.
The above will cost us 10 USD for all the operations involved, every time we make a change.

I would instead heavily suggest going with either

javascript frontend rendering based on markdown files fetched from our API
rendering a fully-static page dynamically via cloudflare workers, then caching it heavily

Slightly relevant blog post: https://blog.cloudflare.com/serverless-rendering-with-cloudflare-workers/

Though I personally believe a simple javascript frontend that fetches the markdown files and renders them is the most boring solution.

Static generation also heavily limits our flexibility for the future, like wanting to add new dynamically fetched components, which I strongly dislike.

DicsyDel commented 2 months ago

To consider: Documentation is not 1 version per module/provider, it needs to be available for each module / provider version. So, it's much, much more than 10GB.

janosdebugs commented 2 months ago

@DicsyDel I downloaded all versions for the experiment.

matteoredaelli commented 2 weeks ago

Ok for a web UI but it would be also useful a command line option for searching providers like

tofu providers search postgres

Matteo

janosdebugs commented 2 weeks ago

@matteoredaelli thank you for your input. Since the CLI search is a separate effort, please open a separate issue for it.

flickerfly commented 4 days ago

Is it intended that the UI is an app that uses an API that the registry has or built into the registry and can not be separated from it?

janosdebugs commented 4 days ago

@flickerfly the current UI in development indexes the registry data in the registry repo and uploads it to a separate R2 bucket, which is consumed by a React frontend. Why do you ask?

flickerfly commented 4 days ago

I'm interested in the registry as a tool to enable air-gapped and scripted installs/updates of infrastructure. In that case, I wouldn't need and would consider the UI to introduce potential additional vulnerabilities I'd like to not track. Keeping them as isolated capabilities would enhance my registry experience in these disconnected spaces.