RFC: Clientside-rendered, 3-Component Registry UI

Yantrio commented 2 months ago

Abstract

The goal of this RFC is to propose the development of a comprehensive Registry UI for OpenTofu.

The Registry UI will serve as a central hub for discovering, managing, and utilizing OpenTofu providers and modules. By implementing this Registry UI, we aim to create a user-friendly, secure, and maintainable platform that will support both the OpenTofu community, fostering greater collaboration and innovation.

Summary

This proposal consists of three primary components: the Doc Store, the API, and the Frontend. Separating these components allows for more focused development and maintenance, improving the overall efficiency and quality of each part.

Doc Store:

Used to host static documentation files that have been scraped
A simple R2 bucket serving static files directly

Registry Backend:

A server-side application responsible for managing the OpenTofu registry data
Provides APIs for the Registry UI to interact with
Provides links to the content in the DocStore for the Frontend to render

Registry Frontend:

A web-based interface enabling users to discover, manage, and utilize OpenTofu providers and modules
Offers a user-friendly experience with intuitive navigation and search capabilities (powered by the Registry Backend)
Displays data provided by the Registry Backend + Doc Store
Designed to be extremely lightweight, relying on the backend for extensive processing and data management
- It should just be allowing for searching and rendering

This separation ensures that each component can be developed and optimized independently, providing a robust development approach and maintenance lifecycle, whilst still granting flexibility to change one of the components if needed, without having to re-work the entire stack.

'The Doc Store'

The Doc Store will host static documentation files for providers and modules. This static documentation will be fetched from the GitHub repositories of the providers and modules, and be normalized into a format that simplifies things. This is needed because there are currently multiple formats of documentation that are accepted.

Background: How documentation is handled today

Documentation for providers and modules is stored alongside the code in the GitHub repository. This means that for each version of provider or module that is shipped, there is a corresponding set of documentation that goes along with this.

Documentation is all written in a Markdown format that can be rendered to HTML.

This documentation can be stored in multiple locations in a repo. As far as I can see there are 2 main setups for this. To simplify things, I will call these the 'Current' format and the 'Legacy' format from here on out.

Current Format:

all contained inside /docs
- /docs/guides/<guide>.md for guides
- /docs/resources/<resource>.md for resources
- /docs/data-sources/<data-source>.md for data-sources
- /docs/functions/<function>.md for functions

Legacy Format:

all inside /website/docs
- /website/docs/guides/<guide>.[html.markdown|html.md] for guides
- /website/docs/r/<resource>.[html.markdown|html.md] for resources
- /website/docs/d/<data-source>.[html.markdown|html.md] for data-sources
- /website/docs/functions/<function>.[html.markdown|html.md] for functions

And both systems support the recently introduced cdktf docs format:

Current format:

/docs/cdktf/[python|typescript]/resources/<data-source>.md
/docs/cdktf/[python|typescript]/data-sources/<data-source>.md
/docs/cdktf/[python|typescript]/functions/<function>.md

Legacy format:

/website/docs/cdktf/[python|typescript]/r/<resource>.[html.markdown|html.md]
/website/docs/cdktf/[python|typescript]/d/<data-source>.[html.markdown|html.md]
/website/docs/cdktf/[python|typescript]/functions/<function>.[html.markdown|html.md]

Documentation Normalization

To ensure that documentation is accessible at any point in a unified manner, I propose that whilst we are ingesting the documentation data, we should normalize the filenames to match the 'Current' format as shown above. This ensures that any clients that want to consume that documentation can be simplified and don't need to have knowledge of the directory structures or filenames. And it also means that if the Doc Store is created in Cloudflare R2, that we can expose the R2 bucket directly with no extra requirements for a smart API to "know" the documentation format.

[!IMPORTANT] Whilst I am unable to find any examples of where the documentation format changes over time. I think it would be safe to assume that the documentation format could differ between each version of a provider or a module. This should be taken into account when normalizing documentation.

Documentation Ingesting

We will need to ingest documentation into the Doc Store periodically. This process could be triggered in multiple ways:

A scheduled task that runs every X hours
Triggered based on commits to the registry metadata repo (Github.com/opentofu/registry)
Triggered based on a github action on the registry metadata repo

I am unsure which is the best approach here, and we will probably want to try and keep things flexible where possible to ensure that we can switch things up later on. My initial reaction is to propose that we write code that can update individual repos instead of the entire registry all at once so that we can trigger individual ingesting in the future but make a wrapper today that ingests all repos.

This process will iterate across versions in the repository (Github releases for providers, tags for modules) and ingest the documentation, normalize the directory structure and filenames and store the documents in cloudflare R2.

[!TIP] The OpenTofu registry already contains information about each release, and we should intend to use this as much as possible so that we are not duplicating the work of scraping versions.

To fetch the documentation from the GitHub Repo, I propose that we git clone the repository and traverse each of the commits of the versions to fetch the documentation. This is much simpler than downloading the tarball for each release/tag from the GitHub API, and is less likely to result in us hitting API rate limits.

I propose that we take existing knowledge and code from the OpenTofu registry, and use a golang application and rclone to sync to r2. This reduces the amounts of unknowns.

Registry API

The registry API is the "magic glue" that is required to provide functionality to the frontend.

This API should be able to allow for the following functionality

Searching Providers/Modules (Maybe even including their resources/datasources/functions)
Serving a list of versions for providers and modules
Serving an "index" for each version of resources/data_sources/functions for providers and links to the Doc Store for each of these documents
Nice to have: Serving statistical data about providers and modules

This registry API should only need to fetch information from the Doc Store that cannot be served statically and should be kept to a bare minimum.

This could be implemented as either a cloudflare worker or just a normal application that has access to the R2 bucket of the Doc Store. We should cache all results heavily where possible to avoid repetitive access to the most popular resources too.

API Definition

To allow for decoupling of the frontend and the backend, I propose that we ship an OpenAPI or similar spec of the API. This will allow for development of the frontend and the API to be carried out in tandem. This also allows for third parties to consume information from the API easily (Including automation tooling).

Registry Frontend

This very thin react application will sit on top of the Doc Store and the API and act as a rendering engine for the information served by those 2 components. This should be kept as simple as possible to reduce load times.

By keeping this application very thin and stateless, we reduce the requirement for engineers to have knowledge of both the frontend and the backend. And it allows for development of the frontend to happen on its own.

The frontend should be crawlable by search engines and take SEO into acconut where possible. It should also have a very static routing pattern to allow consumers of the UI to send links to their peers that can last a long time. This may require some pre-planning before we write the "pretty" parts.

I propose this frontend has 3 main views: The Main Page, Search Results, Provider/Module Overview, Documentation View

The Main Page

Let's keep it simple, There is no need for much else on top of a search bar. We all know just a search bar on a page is good, however I'm open to other ideas for discoverability.

The Search Results page

This should show a list of providers and modules that match the search. In an ideal world these would be filterable by the API too, but that is a nice-to-have in the long run.

The Provider/Module overview

This page shows an overview of the provider/module, listing versions, statistics, a link to the github repo etc, and most importantly, a link to the latest version of documentation

The Documentation page

This page will render documentation for the user to consume. This documentation will be a 1:1 mapping from the markdown in the github repo to html. We should not be editing this content in any way, just purely rendering it. I propose this fetches docs in markdown format from the Doc Store and renders it clientside in react using something like remark.

Conclusion

By separating the project into three distinct components—the Doc Store, the API, and the Frontend—we ensure each part can be independently optimized and maintained. The Doc Store will host static documentation files, the API will handle data retrieval and versioning, and the Frontend will provide a user-friendly interface for interacting with the content. This structured approach will lead to a robust, secure, and maintainable platform, ultimately supporting the growth and innovation of the OpenTofu ecosystem.

cube2222 commented 2 months ago

Looks good to me in general. Here's a bunch of thoughts:

This process could be triggered in multiple ways

I think scheduled every few hours is good enough to start with.

Let's keep it simple, There is no need for much else on top of a search bar. We all know just a search bar on a page is good, however I'm open to other ideas for discoverability.

We'd most likely want to show info on how to submit new providers / modules, and how those get updated / scraped later.

We'll also want provider and module listing pages, ideally sortable, ideally by popularity. Though that might kind of be a subset of the search results page, in a sense.

Additionally, I don't think we need any kind of backend app / worker. That is, I think static files are all we need.

the search index should be rendered when rendering the static files as a single file (e.g. json). This will not be large. It will also make the search snappy, and will incur zero cost on us.
list of providers / modules should also be generated to a single file when scraping and served statically
dynamic stuff like the statistics don't necessarily need a worker; I know so far we've mostly been thinking about GitHub stats of various kinds; those can be fetched directly from github from the frontend, like we do on our homepage (opentofu.org) for github stars at the top; it's just a request to the github api, which, thanks to being on the frontend, won't encounter rate limits; it's also one indexer less to make

Yantrio commented 2 months ago

Additionally, I don't think we need any kind of backend app / worker. That is, I think static files are all we need.

I actually think you're right and I'm going to do some work on this area to figure out how we can do this nicely.

Yantrio commented 2 months ago

I've been doing some work today on putting together the DocStore and hit a few issues. Some of this I have discussed through with @janosdebugs , and some I have not. But the general concensus amongst the core team so far is that we need to scrape the docs into an R2 bucket so we can work on top of that.

For that reason, I'll like to use this comment as an addendum/expansion to my RFC above to discuss what I have found so far. I will attempt to update the initial RFC above once things have solidified more.

Update 1 : Architecture improvements

After talking to both @cube2222 and @janosdebugs I think that the correct approach here is to ditch the idea of a database and instead store "indexes" of information that can be consumed similar to an API. This will help both the frontend and the docstore by acting as a stateful representation of the current data stored in the registry.

With some rough back of the back of a napkin math, I have determined that we only update around 40 ish versions of modules and providers each day. This makes me want to split the process into 2 parts.

Part 1. Initial population of the document store. This should iterate across all git repos, clone all their versions and then scrape the documentation and parse it into a valid format for consumption in the DocStore. In theory we should only need to update this once.

Part 2. Incremental updates of the document store If we are storing an index file of what is in the document store, we can use that to instruct the incremental update process about which providers/modules need updating.

This means that we do not need to store a cache of every git repo we're consuming, or use a database/queue to handle processing. We just act upon a diff of 2 known states: The opentofu registry metadata files (in github.com/opentofu/registry), and the index json files that exist in the R2 bucket.

The downside of this is that to calculate the diff between the index in R2, and the opentofu registry, the process that will scrape the documentation will need to read from the R2 bucket instead of being instructed what has been changed. This goes against my initial proposal (quoted below)

We will need to ingest documentation into the Doc Store periodically. This process could be triggered in multiple ways:
A scheduled task that runs every X hours
Triggered based on commits to the registry metadata repo (Github.com/opentofu/registry)
Triggered based on a github action on the registry metadata repo
I am unsure which is the best approach here, and we will probably want to try and keep things flexible where possible to ensure that we can switch things up later on. My initial reaction is to propose that we write code that can update individual repos instead of the entire registry all at once so that we can trigger individual ingesting in the future but make a wrapper today that ingests all repos.

This process will iterate across versions in the repository (Github releases for providers, tags for modules) and ingest the documentation, normalize the directory structure and filenames and store the documents in cloudflare R2.

However I believe this makes the system much more flexible in the long run as it does not need to be too tightly tied to the opentofu registry github actions.

TLDR

Scrap API, Instead introduce a set of indexing json files to be used by the UI and DocStore
Split DocStore into 2 processes, initial hydration + incremental updates based on the registry metadata and the index files

Update 2. Parsing Frontmatter

To correctly build the navigation items for each of the providers' data sources, resources, functions and guides, we will need to parse the frontmatter of each document to scrape the title of the document.

This process should be done as part of indexing, and means that we will introduce 3 levels of indexing.

Index 1: Global This is a list of all providers and modules, Keep it as simple as possible, this is to enable searching in the frontend.

{
   "providers": [{ "namespace": "mynamespace", "name": "myprovider" }....],
   "modules": [{ "namespace": "mynamespace", "name": "mymodule", "target": "aws" }....],
   "lastUpdated": 1715873609,
   "otherInfo"....
}

Index 2: Version Index This is a list of all the versions that are attached to the provider/module. This will be used by the Doc Store to figure out which versions need updating. I have not yet decided if this is one large json file, or one per provider/module. I think if it's only being used by the Doc Store, then one large json file is fine.

# If stored in a single file, index.json
{
    {
   "providers": [{ 
       "namespace": "mynamespace", 
       "name": "myprovider" 
       "versions": ["0.1.0", "0.2.0"],
   },
   ....],
}

# If stored individually, <namespace>/<name>/index.json
{
    "versions": ["0.1.0", "0.2.0"],
    ....
}

Index 3: Docs Index This is a list of all the links to the documentation pages for the provider or module version. This will be used by the UI to construct a list of documentation navigation items.

# <namespace>/<name>/<version>/index.json
{
   "overview": {...},
   "nav": {...},
   "metadata": {...},
}

If we do go down this route, we should consider the size of this index as it grows, especially if we want to ensure we can search the following items:

Providers
Modules
Resources/DataSources/Functions

cube2222 commented 2 months ago

Re update 1 and the diffing logic - sounds really good and less complex.

Re version index - wouldn't it also be used to power the version selector? In that case I'd suggest index per provider/module.

Overall LGTM @Yantrio

RLRabinowitz commented 1 month ago

I read through and really like the approach (after the update comment 🙃 )

I have a few questions:

Do we think we're OK with periodic updates? Not sure if it'd be the best if it'll take a few hours for documentation for new versions to be updated. I guess that we can trigger this process via the registry's auto-sync process, but that would couple the two processes together
So when a new provider/module is added, it'll only appear in the documentation on the next scheduled run for synchronizing documentation?
I'm not sure I understand the following part: "I propose that we git clone the repository and traverse each of the commits of the versions to fetch the documentation". Does that mean cloning and then checking out specific revisions to get their documentation?

cam72cam commented 1 month ago

I've recently (in the past few months) optimized the "Generate and Sync" workflow to only focus on modules/providers with updates in the past hour. This means only the recent changes to the repository made by "Bump Versions" (run every 15 minutes) are synchronized to r2. In case of a github actions outage or other issue, we still run the full sync daily to catch any stragglers. In practice most provider and module releases are live in 15 minutes or less.

Perhaps a similar approach could be taken where we prioritize refreshing documentation when new releases are detected via the existing "git log filtered by date` approach? We could still do longer term full syncs to catch any issues. I believe this is what @Yantrio is describing above.

opentofu / registry