Note: This proposal is from myself personally, and should not be reviewed as the opinions of the OpenTofu team

Second Note: This proposal is a work in progress and I will be uploading parts as I complete them, However please feel free to comment/discuss etc in the meantime

Related to: #258

Introduction

One of the concepts core to ensuring great developer experience is to aggregate a wide array of providers and modules without burdening users to manually publish them every time. The automations and processes described in this proposal ensures the registry stays rich and updated. However, this doesn't sideline the crucial role of the users. We foresee and encourage user engagement, especially when it comes to tasks like adding GPG keys or validating content.

This proposal will consist of a few components:

1. Metadata Crawler

Dedicated to fetching provider and module versions, the Metadata Crawler will ensure that the registry is always up-to-date with the latest versions available. It will scour the repositories and keep a tab on any new releases, updates, or changes.

Link: https://github.com/opentofu/opentofu/issues/722#issuecomment-1761658302

2. Documentation Crawler

A crucial aspect of any tool or platform is its documentation. The Documentation Crawler will be tasked with populating the website's documentation section. It will fetch, process, and structure the documentation content from the respective repositories, ensuring users have access to accurate and comprehensive guides and references.

Link: https://github.com/opentofu/opentofu/issues/722#issuecomment-1761669467

3. GPG Key Validation

Security is paramount. The GPG Key Validation component will ensure that the content added to the registry is authenticated and comes from a trusted source. It will verify the GPG keys associated with the providers or modules, adding an extra layer of security and trust to the platform.

Link: https://github.com/opentofu/opentofu/issues/722#issuecomment-1761669990

4. v1 API

Acting as the backbone for data delivery, the v1 API will sit on top of the data generated by the Metadata Crawler. It will also integrate data from the GPG Key Validation component. This API will be designed to ensure quick and reliable data access for any tools or platforms interfacing with the OpenTofu Registry.

Link: https://github.com/opentofu/opentofu/issues/722#issuecomment-1761670254

5. Other Considerations

Link: Coming soon!

Summary

By adopting this componentized structure, I aim to create a cohesive yet modular system, where each component can evolve independently, reducing interdependencies and ensuring a robust and scalable registry.

Metadata Crawler Strategy for OpenTofu Registry

Introduction

One of the concepts core to ensuring great developer experience is to aggregate a wide array of providers and modules without burdening users to manually publish them every time. This automation ensures the registry stays rich and updated. However, this doesn't sideline the crucial role of the users. We foresee and encourage user engagement, especially when it comes to tasks like adding GPG keys or validating content.

This proposal will focus on the automation side of crawling existing information to provide access to a multitude of providers and modules.

Note: This proposal is from myself personally, and should not be reviewed as the opinions of the OpenTofu team

Existing Issues and Problem Set

Rate Limit Hurdles: The current system's dependence on GitHub's GraphQL API is becoming a bottleneck. With an ever-growing list of over 20,000 providers and modules, without any sort of strategy, we will easily be hitting GitHub's rate limits, which impedes any real-time update aspirations.
Handling Staleness: Not every repository gets updated at the same pace. While some are buzzing with activity, others haven't seen a change in a while. We will need a smarter way to prioritize our updates based on this varied activity.
Direct Fetch Drawbacks: Pulling data straight from GitHub releases has its issues. It's not the fastest method, and it's prone to occasional hiccups, mainly HTTP errors. We're aiming for a more reliable approach.
Balancing Automation and User Inputs: While automation is great for efficiency, we value the unique touch users bring when they publish and validate their content. It's about ensuring the registry is comprehensive but also trustworthy.

Overview

To address these challenges, I'm proposing a crawler system tailored to efficiently gather provider and module data while smartly managing rate limits and repository staleness. A primary objective in designing this system is to ensure lightning-fast response times for the API. The most effective way to achieve this speed is by serving static content that's pre-processed and ready for delivery, eliminating the need for on-the-fly logic or transformation. Think of it as a sophisticated caching mechanism, where the data is always ready to be served as soon as a request hits the API.

Key Components

Crawler Engine: At the heart of this proposal is a crawler engine that will navigate through repositories, fetching relevant data. t's designed to be resilient, adept at handling common errors, and recovering gracefully from any hiccups.
Staleness Analyzer: A component dedicated to continually assess the activity levels of repositories. By analyzing factors like the last update timestamp, release frequency, and user interactions, it will assign a staleness score to each repository.
Data Storage Strategy: Once data is fetched, we need an efficient and reliable storage mechanism. This will ensure that users get quick responses when they query the registry.

By integrating these components, my goal is to create a system where the OpenTofu Registry is not just comprehensive, but also efficient, fast, and trustworthy.

Sure, let's delve into the storage aspects.

Storage Strategy: Balancing Freshness and Efficiency

The backbone of any system, especially one that relies on swift and reliable data retrieval, is its storage mechanism. While the crawler fetches data, it's the storage system that ensures this data is available, fresh, and swiftly delivered to users. This section outlines my proposed strategies for storing both staleness values and the static content ready for the API.

Storing Staleness Values

The concept of staleness is pivotal in determining which repositories require updates and how frequently. Staleness encapsulates various factors, from the last update timestamp to the overall activity and user interaction with a repository. To effectively measure and utilize staleness, a robust storage strategy is paramount. Below, I detail two potential solutions, each with its own set of advantages tailored for this specific requirement.

1. PostgreSQL Database

Overview

PostgreSQL, a powerful open-source relational database system, boasts of reliability, extensibility, and SQL compliance. It excels at handling structured data, offering a solid foundation for managing the intricacies of repository staleness.

Why

Its robust indexing capabilities speed up the querying process, especially when determining which repositories need updating. Its relational nature ensures structured and organized data storage, essential for handling staleness values methodically.

How

Each repository can have a dedicated row with attributes like last_updated, release_frequency, and staleness_score. Regular recalibration of these scores ensures the data remains dynamic and reflective of the repository's status.

Pros:

Widespread Familiarity: PostgreSQL is well-known and widely adopted.
Cost-Effective: Options like Aurora Serverless on AWS offer cost optimization.
Versatility: Not restricted to AWS, allowing flexibility in deployment.
Speed: Known for its fast querying capabilities, especially with proper indexing.

Cons:

Operational Overhead: Requires regular backups, updates, and maintenance.

2. Time-Series Database (e.g., Amazon Timestream)

Overview

Time-series databases are tailored to manage and analyze time-stamped data. Given staleness is inherently a factor of time and repository activity, Amazon Timestream and similar databases can offer unique advantages.

Why

Its architecture is built for time-centric data, making it particularly efficient in recording and querying timestamped events. This is invaluable when analyzing update patterns over extended periods.

How

Every repository update is logged with a timestamp. Over time, this data can be used to glean insights into activity patterns, aiding in refining the staleness score and ensuring it accurately represents a repository's update frequency.

Pros:

Tailored for Time Data: Specifically designed for time-centric data.
Scalability: Built to handle vast volumes of time-stamped data.

Cons:

Potential Overengineering: Might be more than necessary for just tracking staleness.
Learning Curve: Might have a steeper learning curve compared to PostgreSQL.
Maintenance: Unique challenges might arise that teams should be prepared for.

In summary, while a time-series database has specialized advantages, the potential for overengineering remains a concern. The familiarity and ease of maintainability associated with PostgreSQL make it a strong contender, particularly when considering the long-term operation and flexibility.

Storage of Static Content

The primary goal of storing static content is to enhance the speed of the API by serving pre-processed data, effectively acting as a cache. This ensures users obtain the required data swiftly without waiting for real-time processing or transformation. The following storage solutions can be employed to achieve this goal.

1. Amazon S3 Buckets/Cloudflare R2/Other s3 compatible service

Overview

Amazon S3 (Simple Storage Service) and Cloudflare R2 provide scalable object storage solutions. Both these platforms are designed to hold vast amounts of static content, from files to structured datasets. The prevalence of the S3 API in storage solutions offers adaptability, allowing you to be flexible in choosing where and how to cache the data.

Why

Both S3 and R2 offer high durability and availability. Files stored are accessible via unique URLs, ensuring quick retrieval. The shared S3 API compatibility between the two ensures a seamless transition or integration if required.

How

Static content, once processed, can be uploaded as files to an S3 bucket or Cloudflare R2 storage. Upon an API request, the necessary file's URL can be provided for direct access, ensuring swift response times.

Pros:

High Durability and Availability: Data is redundantly stored across multiple facilities.
Scalable: Can handle vast amounts of data without performance degradation.
Flexibility: The S3 API's ubiquity ensures compatibility with various platforms and tools.
Cost-Effective with R2: Cloudflare R2 offers no egress fees, potentially leading to cost savings.

Cons:

Data Retrieval Costs with S3: Amazon S3 has associated data egress costs which can accumulate based on the amount of data transferred out.
Complexity in Management: Both solutions require proper configuration and policies for secure data storage.

By leveraging the flexibility of the S3 API, you can optimize for cost, speed, and flexibility, picking the storage solution that best aligns with your needs and budget. This might mean starting with S3 for its familiarity and then transitioning to R2 for its cost-saving benefits, or using a combination of the two.

2. Amazon DocumentDB

Overview

Amazon DocumentDB is a managed NoSQL database service that supports document data structures and offers MongoDB compatibility.

Why

It's designed to handle JSON-like documents, making it suitable for structured static content. The scalability and managed backups ensure consistent performance and data safety.

How

Processed static content is stored as documents within collections in DocumentDB. API requests would query the relevant document and serve it, maintaining fast response times.

Pros:

Scalable: Can handle large datasets while offering consistent performance.
Managed Backups: Automated backups ensure data safety.
Compatibility: Offers MongoDB compatibility, easing potential migrations.

Cons:

Cost: Can be pricier than other solutions, especially for large datasets.
Overhead: Requires regular maintenance, updates, and monitoring.

3. Other Static Content Hosting Services

Overview

Several platforms, like Netlify or Vercel, specialize in hosting static content, optimizing for speed and performance.

Why

These platforms are built for swift delivery of static content, ensuring users experience minimal latency.

How

Processed static content is deployed to these platforms. They handle the distribution, ensuring data is served swiftly to API requestors.

Pros:

Performance Optimization: Built specifically for swift static content delivery.
Ease of Use: Simplified deployment processes and automated optimizations.

Cons:

Limited Control: Might not offer the same level of control as other hosted solutions.
Cost Structure: Pricing can be based on traffic, which might not be ideal for high-demand scenarios.

When choosing a solution, it's crucial to weigh the speed of content delivery against the costs and management overhead. The ideal choice would offer a balance, ensuring users get the data they need quickly, without incurring unsustainable expenses or operational burdens.

Caveat

The decision on which storage solution to adopt—whether it's Amazon S3/Cloudflare R2, DocumentDB, or another static content hosting service—should be primarily driven by the requirements and architecture of the v1 API. The crawler's primary role is to collect and process data, and its optimal operation is certainly influenced by the chosen storage solution. However, the broader system architecture and the specific demands of the v1 API could be the decisive factors in selecting the most suitable storage method. It's essential to view the storage decision in the context of the entire system, not just in relation to the crawler's functionality.

Implementation Details

1. Crawling

Overview:
The primary function of the crawler is to iterate over a selection of providers and modules based on the available GitHub rate limit budget. Items are chosen in order of their staleness, ensuring that the most out-of-date items are refreshed first.

Flow:

sequenceDiagram
    participant Trigger as Trigger
    participant Crawler as Crawler
    participant Github as GitHub
    participant DB as Database

    Trigger->>Crawler: Initiate Crawling
    Crawler->>Github: Check GitHub Rate Limit Budget
    Github-->>Crawler: Remaining Rate Limit
    Crawler->>DB: Fetch Items Based on Staleness & Budget
    DB-->>Crawler: Provide Selected Items
    par For each selected item
      Crawler->>Github: Fetch Releases
      Github-->>Crawler: Return Release Data
      Crawler-->>Crawler: Generate static content
      Crawler-->>Storage: Store static content to be served by API
      Crawler->>DB: Reset Staleness to 0 for item
    end

2. Manual Update

Overview:
This flow facilitates an immediate update, allowing for real-time refreshes when needed, especially handy for specific providers or modules that may have undergone significant changes.

Flow:

sequenceDiagram
    participant Trigger as Trigger
    participant Crawler as Crawler
    participant Github as GitHub
    participant DB as Database

    Trigger->>Crawler: Request Manual Update
    Crawler->>Github: Check GitHub Rate Limit Budget
    Github-->>Crawler: Remaining Rate Limit
    Crawler->>Github: If Budget Allows, Fetch Latest Data
    Github-->>Crawler: Return Latest Data
    Crawler->>DB: Force Full Refresh & Reset Staleness to 0
    DB-->>Crawler: Confirm Update

3. Update Staleness Values Over Time

Overview:
After the primary crawling process, a job kicks in to recalibrate the staleness values of all items in the database. This recalibration considers various factors, such as the last update time, weighting, download counts, etc.

Flow:

sequenceDiagram
    participant Crawler as Crawler
    participant DB as Database

    Crawler->>DB: After Primary Crawling, Start Staleness Update
    DB-->>Crawler: Provide All Providers/Modules
    Crawler->>DB: Recalculate Staleness Based on Heuristics & Update Values
    DB-->>Crawler: Confirm Update

These diagrams and workflows provide a structured view of how the crawler functions, how manual updates occur, and how staleness values are kept in check over time.

Implementation Choices

When it comes to bringing the crawler strategy for the OpenTofu Registry to life, the implementation infrastructure plays a pivotal role. While the design decisions emphasize efficiency, scalability, and cost-effectiveness, the underlying technology stack can significantly influence these outcomes. Two leading contenders in this domain are AWS Lambda and Docker. Both offer distinctive advantages, tailored to varying use cases. This section delves deep into these options, weighing their pros and cons, to guide the optimal path forward.

AWS Lambda

Overview: AWS Lambda is quick and easy to get started with, and known by many. You can execute code in response to specific events without managing the underlying infrastructure. It's a serverless compute service, tailored for quick, small-scale operations.

Pros:

Cost-Effective: Lambda's pay-as-you-go model ensures you're only billed for the exact amount of resources your operations consume. No charge for idle time.
Scalability: AWS Lambda functions are inherently scalable. AWS manages the scaling automatically, running code in response to each trigger individually.
Easy Integration with AWS Services: Being an AWS service, Lambda seamlessly integrates with other AWS services like Amazon S3, DynamoDB, and more. This can be especially beneficial if the crawler's data sources or other components reside within the AWS ecosystem.
Event-Driven: Lambda is designed to use events like changes to data in an Amazon S3 bucket or updates in a DynamoDB table to trigger the execution of code. This could be leveraged to trigger crawling tasks.

Cons:

Runtime Limitation: Lambda functions have a maximum execution time of 15 minutes. This might be limiting if a single crawling task takes longer.
Memory and Storage: Lambda has preset limits on memory and temporary storage.
Cold Starts: If a Lambda function isn't called for some time, it can experience an initial latency, known as a cold start.

Docker Containers

Overview: Docker provides a platform to develop, ship, and run applications inside containers. It's like virtualization but more lightweight. Containers bundle up the application code and its dependencies, ensuring consistency across different environments.

Pros:

Portability: Docker containers can run anywhere – on a developer's local machine, on physical or virtual machines in a data center, cloud providers, etc. This is especially beneficial if there's a plan to move away from AWS or adopt a multi-cloud strategy.
Consistency: Docker ensures that the application runs the same, irrespective of where the container is being run.
Flexibility: Docker containers can be configured to use specific amounts of CPU, memory, and other resources.
Microservices Architecture: Docker fits perfectly in microservices architectures, allowing each component of an application to run in its container.

Cons:

Overhead: Running applications in containers introduces a slight overhead.
State Management: Containers are stateless. Managing and storing state can be tricky and often requires integrating with other services or tools.
Complexity: Setting up, managing, and orchestrating containers might introduce additional complexity compared to serverless functions.

Choices Summary: The choice between AWS Lambda and Docker hinges on the specific requirements and constraints of the crawler system. Lambda offers a simple, cost-effective, and scalable solution, ideal for sporadic tasks. However, its limitations in runtime and memory could be constraining. On the other hand, Docker offers more flexibility and portability, at the cost of added complexity.

Future Challenges and Considerations

As with any dynamic and expansive system, the OpenTofu Registry crawler strategy may face challenges as it evolves. Anticipating these challenges and planning for them in advance can help in ensuring the robustness and adaptability of the system. Here are some potential future problems and considerations:

Data Integrity and Consistency: As the data grows, ensuring that the crawled data remains consistent and error-free becomes paramount. There might be a need for periodic validation checks or mechanisms to handle corrupted data.
Dependency on External Services: The current reliance on platforms like GitHub and their rate limits could be a limiting factor. Diversifying the sources or having backup strategies can mitigate this.
Security Challenges: As the system becomes more popular, it might become a target for malicious actors. Ensuring the security of the stored data, especially with public keys and signatures, will be crucial.
User Feedback and Expectations: As more users interact with the registry, their feedback and expectations might lead to new requirements. Handling feature requests, bugs, and other feedback efficiently will be essential.
Cost Management: With the growth in data and traffic, costs associated with storage, data transfer, and compute resources might increase. Optimizing for cost without compromising on performance will be a balancing act.

Addressing these challenges head-on and continuously gathering feedback will be key to ensuring the long-term success and reliability of the OpenTofu Registry crawler.

Conclusion

In essence, the proposed crawler strategy for the OpenTofu Registry is designed to automate the aggregation of a vast array of providers and modules, ensuring a seamless and enriched developer experience. By balancing automation with user inputs, considering diverse storage solutions, and contemplating multiple implementation paths, this strategy aims to foster a robust, reliable, and efficient system.

Documentation Crawler Extension for OpenTofu Registry

Introduction

The presence of well-maintained documentation is paramount. For engineers, the presence and the discoverability of provider documentation in the registry is extremely important. The metadata crawler I proposed for the OpenTofu Registry effectively fetches provider and module details/metadata. However, there's a string need to further enhance this by systematically aggregating associated documentation. This not only enriches the data but also amplifies the usability of the platform.

The vision is to design an API that serves up version-specific markdown documentation. This approach ensures that the OpenTofu Registry remains lightweight and agile. By offloading the rendering responsibility to the client side, we can focus on the core competence of fetching, storing, and serving the documentation in its raw, markdown format. Such a design decision also offers flexibility to the consumers of the API, allowing them to render the documentation in ways best suited to their application or platform.

Given the convention-driven structure of Terraform documentation, the crawler can be optimized to fetch information from predefined paths and structures. This ensures that the documentation pulled is comprehensive, up-to-date, and consistent across different providers.

The ultimate objective is twofold: firstly, to provide developers with a seamless integration experience by offering comprehensive documentation, and secondly, to keep the backend operations streamlined by serving raw markdown, deferring the rendering operations to the client side.

Requirements

Version-Specific Documentation: For each version of every provider (not modules), the crawler needs to render the corresponding documentation.
Git-Based Fetching: By convention, the requisite documentation is generally located in the /docs/ directory at the root of the repository. This provides a standard path for the crawler to follow.
Structured Documentation: Terraform documentation follows a well-defined structure, which includes provider indices, resources, data sources, and guides. Each of these components is housed in specific locations, like docs/resources/ for resources, and has a predetermined format. For instance, resource documentation should have headers like "Example Usage", "Argument Reference", and "Attribute Reference".
Flexible Rendering: Some providers might have their documentation formatted for the legacy terraform website publishing. In such cases, the crawler should also support fetching from website/docs/ with .html.markdown or .html.md extensions.

Navigation and Hierarchy: The crawler should respect the inherent hierarchy of the documentation, ensuring that the navigation rendered on the Terraform Registry mimics the structure of resources, data sources, guides, and their potential subcategories.

Documentation Crawling Strategy

Building on the foundational principles set out in The metadata crawler, and it's section titled Storage Strategy: Balancing Freshness and Efficiency, the objective here is to adapt and extend a similar strategy to handle documentation.

Strategy Overview

Staleness at Release Level: While the previous issue dealt with staleness at a repository level, in the context of documentation, we're focusing on the granularity of individual releases. Each release might have its unique documentation updates, and we aim to ensure the latest and most relevant documentation is always available. This is because newer or more popular versions may require to be crawled before others.
Fetching and Storing: The crawler is tasked with fetching the necessary documentation files from each release of a repository. Leveraging the convention that documentation resides in the /docs/ directory at the root of the repo, the crawler will effectively navigate to and collect the necessary markdown files.
Static Content Storage: Once fetched, these documentation files are stored in a solution like Amazon S3 or a similar static content storage service. The key advantage here is the ability to serve this content directly via the API. By doing this, we drastically reduce the time and computational resources needed to fetch on-demand, leading to faster response times and a more streamlined user experience.

This proposed approach not only ensures that the documentation served is always up-to-date but also optimizes the efficiency of our system by minimizing real-time fetches and computations.

Operational Dynamics: Front-Heavy Initialization

The implementation of the documentation crawling strategy for the OpenTofu Registry, especially in its early stages, is inherently front-heavy. This trait is shaped by the initial requirements and the nature of the task at hand. Here’s why:

Initial Onboarding of Providers

At the outset of the OpenTofu Registry's life, there exists a vast array of providers, each with its own set of versions and corresponding documentation. The crawler's primary task is to fetch, process, and store the documentation for every version of every provider. Given the sheer volume of data to be handled, this initial phase is intensive and time-consuming, but over time this process should be much quicker as it will just need to keep things up to date.

However: It is worth noting that if we choose to allow mirror hosting of this process, that this documentation fetching may not just happen once overall, but once per hosted version of the registry.

Version-specific Staleness

Unlike the staleness handling at the repository level (as discussed in The metadata crawler proposal), the documentation crawler has to manage staleness at a more granular level — per release. This means that each version of a provider is treated as a unique entity, with its own staleness metric. The challenge here is not just to track the staleness of hundreds of repositories, but to manage the staleness of potentially thousands of individual releases.

Benefits of Front-Heavy Approach

Efficient Updates: Once the initial bulk of data is processed and stored, updates become more manageable. As new versions of providers are released, only those need to be fetched and processed. This significantly reduces the workload after the initial setup.
Predictable Workload Reduction: As the registry matures, the frequency of new version releases will likely stabilize. This means that, over time, the number of new versions that need to be crawled will be relatively constant, leading to a predictable and manageable workload.
Optimized User Experience: By bearing the brunt of the work upfront, the registry ensures that users have immediate access to documentation. The front-heavy approach prioritizes user experience by ensuring that, once initialized, documentation retrieval is fast and efficient.

In conclusion, while the front-heavy nature of the documentation crawler presents initial challenges, it is a strategic decision. It ensures that the OpenTofu Registry is well-prepared to serve its users efficiently from the get-go and can smoothly handle future updates.

Advantages of Separate Documentation Crawling

Handling documentation crawling separately from metadata parsing in the OpenTofu Registry offers distinct benefits:

1. Specialized Efficiency

Purpose-Driven Design: Each crawler can be tailored for its specific task. The metadata crawler is optimized for rapid parsing of provider/module metadata, while the documentation crawler is structured for fetching and storing larger documentation files.
Optimized Resource Allocation: By dividing the tasks, resources are allocated more effectively. The documentation crawler, which might need more storage and bandwidth, won't strain the resources meant for metadata parsing.

2. Fault Isolation

Reduced Impact of Failures: If an issue arises in one crawling process, it doesn't directly affect the other. For example, a problem fetching documentation doesn't compromise metadata integrity.
Easier Troubleshooting: With distinct processes, pinpointing and resolving problems becomes more straightforward.

3. Scalability and Flexibility

Independent Scaling: As the number of providers/modules grows, each crawler can be scaled based on its specific needs.
Adaptive Strategy: Modifying the strategy or introducing new features for one crawler doesn't necessitate changes in the other.

4. Streamlined Maintenance

Modular Updates: Enhancements or patches can be rolled out to one crawler without disrupting the other's operations.

5. Hosting Flexibility

Decoupled Frontend and Backend: A distinct advantage is the option for someone to host a registry without necessarily hosting the frontend. This means they could have tofu work seamlessly with the backend registry, without needing a dedicated UI.

In essence, separating metadata parsing from documentation crawling not only ensures higher efficiency and reliability but also provides flexibility in hosting options. This modular approach aligns with the goal of delivering a versatile and efficient experience in the OpenTofu Registry.

Implementation Details

Let's break down the documentation crawling process and represent it in sequence diagrams.

Flow 1: Documentation Crawling

Staleness Check for Documentation:
- The system checks the database to determine the staleness of documentation for each provider version, helping to prioritize which documentation needs fetching or refreshing.
Existence Check:
- Before accessing the repository, the system checks the storage solution to see if the documentation for the particular provider version already exists.
Fetch Repository:
- If the documentation is either stale or non-existent in the storage, the crawler accesses the repository of the targeted provider version.
Access and Store Documentation:
- Navigate to the /docs/ directory (or the legacy website/docs/).
- Retrieve the markdown files and any associated content.
- Store them in the designated storage solution (e.g., S3).
Update Staleness Value:
- After successful retrieval and storage, reset the staleness value for that particular provider version's documentation.

Sequence Diagram for Documentation Crawling:

sequenceDiagram
    participant C as Crawler
    participant DB as Database
    participant S as Storage (e.g., S3)
    participant R as Repository

    C->>DB: Check staleness of provider version's docs
    DB-->>C: Return staleness value
    C->>S: Check existence of documentation
    S-->>C: Confirm existence status
    alt Documentation is stale or non-existent
        C->>R: Fetch documentation from /docs/
        R-->>C: Return markdown files
        C->>S: Store markdown files
        S-->>C: Confirm storage
        C->>DB: Update staleness value to 0
    else Documentation is current and exists
        C->>+DB: No action needed
    end

This flow ensures that only necessary fetches are made, optimizing the process where possible, but still has a staleness check to ensure that manual re-fetching (or periodic refetching) can occur.

Flow 2: Manual Trigger for Documentation Update

Manual Update Trigger:
- An external action or user request initiates a manual update for specific provider version documentation.
Existence Check:
- The system checks the storage solution to see if the documentation for the particular provider version already exists.
Fetch Repository:
- Regardless of the outcome of the existence check, the crawler accesses the repository of the targeted provider version to get the latest version of the documentation.
Access and Store Documentation:
- Navigate to the /docs/ directory (or the legacy website/docs/).
- Retrieve the markdown files and any associated content.
- Overwrite or store them in the designated storage solution (e.g., S3).
Update Staleness Value:
- After successful retrieval and storage, reset the staleness value for that particular provider version's documentation.

Sequence Diagram for Manual Trigger:

sequenceDiagram
    participant U as User
    participant C as Crawler
    participant R as Repository
    participant S as Storage (e.g., S3)

    U->>C: Manual trigger for documentation update
    C->>S: Check existence of documentation
    S-->>C: Return existence status
    C->>R: Fetch documentation from /docs/ regardless of existence
    R-->>C: Return markdown files
    C->>S: Store or overwrite markdown files
    S-->>C: Confirm storage
    C->>U: Notify update completion
    C->>DB: Update staleness value to 0

By allowing manual triggers, you provide flexibility and assurance that documentation can always be updated to the latest version when needed, and manually re-fetched if issues occur that require human intervention.

Implementation Choices

For the documentation crawler, there are two primary routes we can take for the technical implementation:

AWS Lambda:
- Using AWS Lambda allows for a serverless architecture, which can be cost-effective as you only pay for the compute time you consume. Lambda integrates well with other AWS services and can be easily triggered by various AWS events. This makes it suitable for tasks that need to be run periodically or in response to specific events.
Docker Containers:
- Docker containers offer flexibility in deployment and can be run both on cloud platforms and on-premises. By packaging the crawler within a container, it ensures that the environment is consistent wherever it's run. This decouples the solution from AWS, offering portability across different cloud providers or even in hybrid environments.

For an in-depth comparison and considerations between these two options, please refer to the Implementation Choices section in the metadata crawler proposal.

Frontend API Considerations

While the primary focus of this proposal revolves around backend operations and data management, it's essential to acknowledge that the end result aims to serve developers and users. Therefore, a frontend API is inevitable in the grand scheme of things.

The envisioned API would essentially behave as a conduit, delivering static content that sits atop storage solutions like S3 or Cloudflare R2. By ensuring that the content is pre-processed and ready for delivery, the API can be optimized for speed and responsiveness, akin to serving any static resource, be it an image, a stylesheet, or a script.

However, the intricacies of designing, developing, and maintaining this frontend API warrant a detailed discussion and will be the primary focus of a subsequent RFC. The future document will delve into the design principles, scalability considerations, security practices, and other relevant aspects of building a robust and user-friendly interface.

GPG Key Management for OpenTofu Registry

Introduction:

GPG keys are the cornerstone of trust in the Terraform and OpenTofu ecosystem. They serve as the validation mechanism for provider signatures. For the OpenTofu Registry to maintain its integrity and trustworthiness, managing these GPG keys extremely important. These keys validate the authenticity of provider artifacts on GitHub, establishing a trusted link between the registry and the provider source. However, the management of these keys must remain with the provider authors, ensuring accountability. Furthermore, the availability and auditability of these keys should be transparent and readily accessible.

Why We Need This:

The introduction of GPG key validation in Terraform and OpenTofu has been pivotal in ensuring a secure infrastructure-as-code environment. However, a challenge arises when these keys are missing or not readily accessible. To address this, OpenTofu has temporarily disabled GPG validation until a robust mechanism for GPG key management is established. This decision ensures uninterrupted service but emphasizes the need for a permanent solution. For more context, see RFC: Provider GPG Key Handling in OpenTF Registry: Temporary Non-Failure Response to Missing Public Keys #266.

Proposed Solution:

To streamline the GPG key management process, I propose the following:

Centralized GitHub Repository for GPG Keys: Create a dedicated GitHub repository for GPG keys. This repository will act as the central hub where end users can upload their public keys.
Pull Request (PR) Mechanism for Key Uploads: Users can introduce their GPG keys by creating a pull request to this repository. The PR should place the keys within a folder named after the respective organization, allowing for structured storage.
Rigorous Review Process: Before merging any PRs, a meticulous validation process will be in place. This review will ensure that the keys are genuine and belong to the respective provider or module. This process can be designed later and evolved over time. I would like to keep discussions here to be just about the technical solution
Integration with OpenTofu APIs:
- v1 API Integration: The v1 API will source GPG key data directly from this GitHub repository (or its mirrored location). This ensures that the GPG key data in the provider version download response remains up-to-date.
- Website API Integration: With the availability of validated GPG keys, the Website API can flag providers as "verified" or use other markers to indicate the authenticity of a provider.

Key Revocation and its Challenges with the PR Process

Key revocation is an essential aspect of any GPG management system. It ensures that if a key gets compromised, it can be quickly invalidated, preventing potential misuse. The current approach of using a PR process, while robust and thorough, has its limitations when it comes to swift key revocation.

Human Dependency: The PR process relies heavily on human intervention. Key revocation requests need to be manually reviewed and approved. This dependence becomes a bottleneck, especially during off-hours or if the responsible parties are unavailable.
Time Sensitivity: In cases where a key is compromised, every minute counts. A delayed revocation could lead to potential security breaches. With the PR process, the time between a revocation request and its approval can vary, making it less reliable for urgent situations.
Break-Glass Situations: For instances where immediate action is required, there isn't a clear "break-glass" mechanism in place. A swift, automated, yet secure way to handle these emergencies is vital.

Given these challenges, it's crucial to explore and implement mechanisms that allow for quicker key revocation while ensuring the integrity of the system. Whether it's through automated checks, a dedicated emergency contact system, or other means, enhancing the key revocation process is important.

Alternatives Considered: Why Not Use Other GPG Key Storage Systems?

While formulating this proposal, several GPG key storage systems, such as OpenPGP, Keybase, and others, were evaluated. It's worth addressing why these weren't chosen as my primary solution:

Reliability on External Services: Using external services means we're dependent on their uptime, policies, and potential changes. Any disruption or alteration in their service can affect our registry's operations.
Security Bar is Lowered: By relying on external systems, we cede a degree of control over the validation process. While these systems might maintain high standards, we cannot guarantee a 100% validation rate, potentially leaving room for discrepancies.
Advantage of External Systems - Key Revocation: One notable advantage of systems like OpenPGP and Keybase is the ease of key revocation. They have established mechanisms that make key revocation straightforward and quick. However, the overall benefits of maintaining our own system outweigh this convenience, especially when considering the critical role of GPG keys in ensuring trust and security in the OpenTofu ecosystem.

Conclusion:

The adoption of a centralized GPG key management system, underpinned by a transparent GitHub repository and rigorous validation process, will fortify the trustworthiness of the OpenTofu Registry. It not only reinstates GPG validation in OpenTofu but also ensures that key revocation and updates are efficient and transparent. Looking ahead, the focus will also be on devising a mechanism for swift GPG key revocation, ensuring the continued security and reliability of the OpenTofu ecosystem.

OpenTofu v1 API

Introduction

The v1 API is pivotal for tofu to resolve providers and modules. Anticipating high traffic, it's engineered for speed, serving pre-generated content from the crawler. For detailed protocols and operations, refer to the Module Registry Protocol and the Provider Registry Protocol for more information. Efficiency and responsiveness are extremely important for this API's success and therefore are a hard requirement.

Designing an Agnostic Content Delivery System for OpenTofu Modules

Objective

Develop a streamlined and efficient system to route URLs to their corresponding static content, ensuring minimal latency and maximal speed, irrespective of the storage backend being S3, R2, or any document store.

Key Design Choices

Minimalistic Approach:
- Rationale: The fewer the components involved in a request-response cycle, the faster the response time.
- Implementation: Design the architecture to be as lean as possible. Eliminate unnecessary middle layers or intermediaries.
Storage-Agnostic Backend:
- Rationale: Flexibility to switch between different storage solutions without overhauling the system.
- Implementation: Use generic APIs or interfaces that can be implemented for any storage backend. When a request comes in, the system should know how to fetch the content from S3, R2, or any other storage seamlessly.
Direct CDN Integration:
- Rationale: CDNs enhance content delivery speed by caching content close to the end-users. A direct connection between the CDN and storage can reduce latency.
- Implementation: Instead of having a lambda or server in between, integrate the CDN (like Cloudflare or CloudFront) directly with the storage backend. The CDN would fetch content from storage only when it's not available in its cache.
Special Handling for Module Download Link:
- Rationale: The module download link content is in the header, not in the response body. This unique requirement necessitates special handling.
- Implementation: If using a CDN or edge compute like Cloudflare Workers or Lambda@Edge, the function/script should be tailored to fetch the module's URL from the header and serve it appropriately. This might involve rewriting headers or redirecting requests based on header values.

Considerations

Scalability: As the number of OpenTofu modules grows, the system should scale horizontally. Using a CDN inherently supports this, as CDNs are designed to handle large traffic volumes.
Security: Ensure that the storage backend, whether S3, R2, or any other, has proper security configurations. Unauthorized access should be strictly prohibited. If using Cloudflare Workers or Lambda@Edge, additional security checks or headers can be incorporated.
Cache Control: Define explicit caching rules. Determine the duration content stays cached, when it should be invalidated, and when the CDN should re-fetch from the original storage.
Monitoring and Alerts: Implement monitoring for request/response times, error rates, and other crucial metrics. Set up alerts for anomalies.
Cost: Serving content directly from storage like S3 might incur higher costs due to egress charges. Using a CDN can help in cost reduction as requests served from the cache do not touch the original storage, thus avoiding egress charges.

Exploring Content Delivery Solutions for OpenTofu Modules

Limitations:

Direct CDN on top of a Storage Bucket:

While it would be ideal to have a direct CDN layer like CloudFront or Cloudflare sitting on top of a storage solution like S3 or R2, this isn't feasible for OpenTofu. The primary hindrance is the module download link requirement. Specifically, the module download link content is located in the header, not the response body. This unique factor necessitates special handling which a direct CDN integration might not support out-of-the-box.

Potential Solutions:

If possible, I believe it would be best if we attempt to solve the problem of both modules and providers in the same way. This ensures that the complexity stays the same, instead of introducing 2 solutions just due to the limitation of the modules.

1. Lambda@Edge with AWS Caching:

Overview: Lambda@Edge allows you to run code in response to CloudFront events. This solution involves pairing Lambda@Edge with CloudFront to serve content.

Pros:

Flexibility: Lambda@Edge can read headers and modify them, making it suitable for the module download link requirement.
Integrated with AWS Ecosystem: If other parts of the system are on AWS, this solution will seamlessly integrate.
Performance: With CloudFront caching, frequently accessed content is served faster to the end-users.
Storage Backend: Can be easily integrated with S3 or any other AWS storage solution. With some additional configuration, it can also interface with external storage solutions.

Cons:

Cold Starts: Lambda functions can experience cold starts which might add to the latency.
Cost: Lambda@Edge can be more expensive than standard Lambda due to its global replication nature.

2. Cloudflare Workers:

Overview: Cloudflare Workers let you run JavaScript at the edge, close to the end-user. They can be used to intercept and modify requests and responses.

Pros:

Edge Computing: Since the code runs at the edge, latency is typically lower.
Header Manipulation: Cloudflare Workers can easily handle header-based requirements like the module download link.
Storage Backend Flexibility: Workers can interface with various backends, including Cloudflare's own KV store, external systems, or even AWS S3 with proper configuration.
Global Network: Cloudflare has a vast global network, ensuring content is served quickly from a nearby location.

Cons:

Learning Curve: If one is not familiar with Cloudflare's ecosystem, there might be an initial learning curve.
Cost: While Workers are generally cost-effective, heavy usage can increase costs.

Final Thoughts:

Both Lambda@Edge and Cloudflare Workers provide the flexibility and capability needed to serve OpenTofu modules efficiently. The choice between them depends on factors like existing infrastructure, familiarity with the platforms, budget considerations, and specific performance needs. Regardless of the choice, both solutions offer the versatility to work with different storage backends, ensuring a flexible and robust system.

Other Considerations

Cost of Running

The financial implications of deploying and maintaining the system are paramount. We must consider the expenses associated with the chosen AWS services, data storage costs, potential network transfer fees, and any additional services or plugins. An efficient design may minimize costs, but as the system scales, expenses can increase significantly.

Developer Experience

The ease with which developers can interact with, understand, and contribute to the system greatly influences the project's success. Factors like clear documentation, a well-defined development environment setup, intuitive API endpoints, efficient debugging tools but most importantly, a great local developer experience play a critical role.

Maintainability

The long-term sustainability of the system hinges on its maintainability. This includes the ease of updating components, implementing new features, fixing bugs, and adapting to changing requirements. Adopting best practices, keeping the system modular, writing comprehensive tests, and ensuring consistent code quality are essential.

Hey James, great proposal! I feel like it's somewhat in line with #724. Moreover, this RFC illustrates agnostic delivery approach. IMO it'd be beneficial for maintenance and adoption because it'd give hosting flexibility and provide an "off the shelf" solution to deploy mirrors.

You mentioned the cost factor as an important non-functional restriction. WDYT about trying to assess it? For example, could we use the number of available providers and modules, the registry RPM data, and the average payload size as a proxy to costs?

Another question - about the crawler and the storage layer (e.g. s3) - would you consider serving provider binary (as it's a static asset) from there rather than directly from github?

Thanks!

This is a fantastic proposal, James. It's a comprehensive, flexible, and agnostic approach to delivering this content to registry users. Throughout the document, you echoed the sentiment that your vision was for this to be a lean and rapid system for providing content. To support this philosophy, do we want to define storage limitations?

On another note, you mentioned that some providers might have their documentation formatted for the legacy Terraform website publishing and may require the crawler to fetch from those locations. Would we want to create a fork of tfplugindocs so that we can allow providers to automatically generate documentation for their provider in the format necessary for the Tofu Registry?

Hi @RoseSecurity ! thanks for taking the time to read through this.

Re: storage limitations

Overall I don't think there will be a huge amount of data in terms of actual storage size, as for now the registry only needs to handle json files, that can nicely be compressed. And if we did some rough math about the number of active providers and modules out there, and the number of versions each of those have we are not talking millions and millions of files to store here.

However, I do think that it raises a question about avoiding abuse of the system, and ensuring that we don't try and ingest data from a repo that has 10000s of fake releases. (I do see this as an EXTREME edge case!)

Overall right now I don't think we need any storage limitations but going forward it's something we should bear in mind.

Re: documentation formats

The tfplugindocs setup seems to work very well for existing providers and the format is well supported. I've actually been playing with legacy format of documentation recently and made some notes which I will dump here for background:

Right now in the existing registry implementations, provider documentation can come in a couple different formats: Latest format:

all inside /docs
/docs/guides/.md for guides
/docs/resources/.md for resources
/docs/data-sources/.md for resources.

Legacy format:

all inside /website/docs
/website/docs/guides/.[html.markdown|html.md]
/website/docs/r/.[html.markdown|html.md]
/website/docs/d/.[html.markdown|html.md]

And both systems support the new cdktf docs format:

Latest format:

/docs/cdktf/[python|typescript]/resources/.md
/docs/cdktf/[python|typescript]/data-sources/.md Legacy format:
/website/docs/cdktf/[python|typescript]/r/.[html.markdown|html.md]
/website/docs/cdktf/[python|typescript]/d/.[html.markdown|html.md]

The AWS provider is a great example of the usage of the legacy documentation system (docs in /website/docs) and it has support for the CDKTF documentation.

Overall, We need to be handling documentation across all providers and all of their legacy releases, so writing a tool to get people to ship documentation in a new format, sadly, isn't going to change the existing old releases. (I think this is why the AWS provider still uses the legacy format)

I've had some success and played around with this recently, and what I've done is write some code that transforms the legacy documentation into the new format as it reads it from the repo, and then our serving side/client side only ever needs to worry about this in the new format.

It's quite rudimentary right now but what it does in essence is: When we detect a directory of ./website/docs, move that to ./docs

move all the files from ./d to ./data-sources
move all the files from ./r to ./resources
rename all the files from .html.markdown to .md
rename all the files from .html.md to .md

This approach means that we don't require people go back in releases in their providers, and the opentofu registry can handle all formats of documentation.

Summary

Hopefully this answers some of your concerns here! I'm aware that I've not properly communicated some of the requirements around things like "storage" and "legacy documentation" and hopefully this fills in those gaps. It's difficult to write this without assuming the reader has some knowledge of the existing requirements without adding 1000s of more words to this already humongous RFC!

Hopefully also this acts a little bit as documentation for the other RFCs that also have to solve this documentation rendering problem.

@Yantrio Thank you for the explanations and background on the internal discussions! Excited to hear how the technical steering committee conversations go!

Closing as the steering committee has chosen a different design for implementation.

More details in today's (2023-11-03) update in https://github.com/opentofu/opentofu/issues/258

opentofu / opentofu

[RFC] Componentized Strategy for OpenTofu Registry #722

Introduction

1. Metadata Crawler

2. Documentation Crawler

3. GPG Key Validation

4. v1 API

5. Other Considerations

Summary

Metadata Crawler Strategy for OpenTofu Registry

Introduction

Existing Issues and Problem Set

Overview

Key Components

Storage Strategy: Balancing Freshness and Efficiency

Storing Staleness Values

1. PostgreSQL Database

Overview

Why

How

Pros:

Cons:

2. Time-Series Database (e.g., Amazon Timestream)

Overview

Why

How

Pros:

Cons:

Storage of Static Content

1. Amazon S3 Buckets/Cloudflare R2/Other s3 compatible service

Overview

Why

How

Pros:

Cons:

2. Amazon DocumentDB

Overview

Why

How

Pros:

Cons:

3. Other Static Content Hosting Services

Overview

Why

How

Pros:

Cons:

Caveat

Implementation Details

1. Crawling

2. Manual Update

3. Update Staleness Values Over Time

Implementation Choices

AWS Lambda

Docker Containers

Future Challenges and Considerations

Conclusion

Documentation Crawler Extension for OpenTofu Registry

Introduction

Requirements

Documentation Crawling Strategy

Strategy Overview

Operational Dynamics: Front-Heavy Initialization

Initial Onboarding of Providers

Version-specific Staleness

Benefits of Front-Heavy Approach

Advantages of Separate Documentation Crawling

1. Specialized Efficiency

2. Fault Isolation

3. Scalability and Flexibility

4. Streamlined Maintenance

5. Hosting Flexibility

Implementation Details

Flow 1: Documentation Crawling

Flow 2: Manual Trigger for Documentation Update

Implementation Choices

Frontend API Considerations

GPG Key Management for OpenTofu Registry

Introduction:

Why We Need This: