nebari-dev / governance

✨ Governance-related work for Nebari-dev
BSD 3-Clause "New" or "Revised" License
0 stars 2 forks source link

RFD - Backup and restore #49

Closed pt247 closed 4 months ago

pt247 commented 6 months ago
Status Open for comments 💬
Author(s) @pt247
Date Created 20-04-2024
Date Last updated 05-05-2024
Decision deadline 22-05-2024
# Backup and restore - RFD A design proposal for Backup and Restore service in Nebari. ## Summary As Nebari becomes more popular, it's essential to have dependable backup and restore capabilities. Automated scheduled backups are also necessary. Planning is vital since Nebari has several components, including conda-store environments, user profiles in KeyCloak, and user data in the NFS store. ## User benefit 1. Nebari admins will get a straightforward backup process using Nebari CLI. 2. Admins will also be able to define a schedule for automated backup in Nebari config. 3. Nebari upgrades can automatically save the state before providing upgrades. 4. User data and other Nebari components can better protect against accidental deletion. ## Design considerations: We need to look at the development, maintenance, administration, and support requirements to decide on an appropriate strategy for this service. Following is a list of key criteria for the service: 1. **Availability**: Service disruption to perform backup or restore. 2. **Observability**: Visibility of progress, error, and status. 3. **Maintainability**: Ease of building, maintaining and supporting. 4. **Composability**: Backup and restore in small chunks independently. 5. **Security**: Access control to the backup and restore service and the backup itself. 6. **Compatibility**: Forward and, if possible, backwards. 7. **Flexibility**: multiple entry points to the backup and restore, e.g. scheduled API 8. **Scalability**: Scalability should scaled to large deployments. 9. **Feasibility**: developing, maintaining, or computing resources. 10. **Compliance**: with various data protection regulations. 11. **On-prem**: On-prem and air-gapped deployments. ## Data protection considerations: 1. **Encryption at rest and in transit**: We must have data encrypted in motion and at rest to protect against unauthorized access. 2. **Backup location**: Several data protection directives in the US and EU limit where we can store certain data assets. We should design the backup and restore solution with this in mind. 4. **Day zero feature**: Encryption at rest and transit needs to be available in the first version of the backup and restore service. 5. **PoLP** (principle of least privilege): Only authorized users should be able to access the backup and restore service. ## In the scope of this RFD: This Request for Discussion (RFD) aims to establish a high-level strategy for backup and restoration. The goal is to reach a consensus on design choices, API, and a development plan for the backup and restoration of individual components. The implementation details of the identified design will be part of another RFD. The focus of this RFD is to develop a backup and restoration strategy for the following components: - Nebari config - Keycloak - Conda-store - User data in NFS ## Out of scope for this RFD: Following Nebari components are not covered in this document. - Nebari plugins - Loki Logs + prometheus - Nebari migration (for, e.g. from AWS to GCP) - Custom backup schedules (e.g. component specific backup schedules) ## Existing backup process You can find the existing docs for backup on this [page](https://www.nebari.dev/docs/how-tos/manual-backup). ## Backup and Restore strategies There are several approaches to Nebari backup and restore. Some are closer to the current backup and restore, and some are entirely novel approaches. Each of these methods has its own set of advantages and disadvantages. In this section, we will summarise the various approaches suggested in the comments, outline the pros and cons, and briefly describe the implementation. ### Backup and restore by component Approach #1 ```mermaid flowchart TD Backup --> Storage Nebari --> |1. config| Backup Nebari --> |2. Keycloak | Backup Nebari --> |3. Conda Store | Backup Nebari --> |3. User Data | Backup Storage --> Restore1 Restore1 --> |1. config| Nebari1 Restore1 --> |2. Keycloak | Nebari1 Restore1 --> |3. Conda Store | Nebari1 Restore1 --> |3. User Data | Nebari1 ``` This approach aims to automate the current manual backup and restore process. A typical Nebari deployment consists of several components like Keycloak, conda-store, user data and more. #### Example Backup flow: ```mermaid flowchart TD A1[CLI] --> B(Backup workflow) A2[Nebari config.backup.schedule] --> B A3[Argo workflows UI] --> B B --> F(Backup Nebari config) F --> D(Backup Keycloak) D --> C(Backup NFS) D --> E(Backup Conda Store) C --> X(Backup Location) D --> X E --> X F --> X ``` #### Example Restore flow: ```mermaid flowchart TD A[Nebari Restore CLI - Specified backup] --> B(Backup workflow - latest backup) A1[Argo Workflows UI] --> B B --> B1(Restore workflow - Specified backup) B1 --> F(Restore Nebari config) F --> D(Restore Keycloak) D --> C(Restore NFS) D --> E(Restore Conda Store) C --> Z(Validate restore completion) D --> Z E --> Z Z --> |failure| X(Restore workflow - latest backup) Z --> |success| Y(Stop) X --> |success| Y X --> |failure| Y ``` `Note`: Both these workflows are, for example, and must be refined/refactored. Let's look at the pros and cons of this approach: **Pros** 1. Feasibility: We can use tried and tested tools for database dump or Restic to sync files between source and destination. 2. Maintainability: Development of each rach task (say backup conda-store) can happen separately and iteratively. 3. Compatibility: Excellent support is available for tried and tested production-ready tools like pg_dump, Restic, rsync and more. This design can use Nebari component agnostic tools, which means the same solution could work for multiple versions of Nebari, providing backwards and forward compatibility. **Cons** 1. Observability: If the backup fails because of a single failed sub-task, it can result in a whole backup or restore failure. This solution offers little Observability. 2. Composability: The success of individual tasks does not guarantee success of overall success. For example, the solution might find a new user's data for backup without the user being there when Keykloak backs up. 3. Scalability: As user data increases, this design might need to evolve to take incremental snapshots. If the time it takes to back up increases, so do the chances of Nebari state changing. 4. Availability: The solution must implement a maintenance window for the entire Nebari during backup and restore processes. ### Finer details 1. Backup location: This design assumes Nebari has read-write access to the backup location. Nebari will manage the backup location. 2. Local backup: If the backup location is a local directory, the client should have access to read-write to that directory. ### Vertical slices per user migration Approach #2 We could look at nebari from the perspective of the user. Each user has some shared and dedicated state in each Nebari component. | Nebari | Shared | Dedicated | |-------------|----------------------------|---------------------| | Keycloak | Groups, Roles, permissions | User profiles | | Conda store | Shared environments | User environments | | JupyterHub | Shared user data | Dedicated user data | The solution recommends backing or restoring shared resources first. We can then backup/restore users in parallel or any order. User migration workflow ```mermaid flowchart LR rc[Restore user] --> rs[Restore shared state] --> ru[Restore user] s[Storage] -.-> rs s -.-> ru bc[Backup user] --> bs[Migrate shared state] --> bu[Migrate user] bu -.-> Storage bs -.-> Storage ``` Nebari migration overall Backup flowchart ```mermaid flowchart LR nb[Nebari Backup] ==> rs[Backup shared state] rs ==> bu1[Backup user A] & bu2[Backup user B] & bu3[Backup user C] -.-> Storage rs --> | ... | Storage rs --> |Backup user n| Storage ``` Restore flowchart ```mermaid flowchart LR nr[Nebari Restore] ==> rsr[Restore shared state] Storage -.-> ru1[Backup user A] & ru2[Backup user B] & ru3[Backup user C] & ru4[...] & ru5[Backup user N] rsr ==> ru1 & ru2 & ru3 & ru4 & ru5 Storage -.-> rsr ``` Let's look at the pros and cons of this approach: **Pros** 1. Fail fast approach: If all goes well, we will have backed up all users. If not, then there will be two possibilities: 1. Shared state backup/restore failure - which will be immediate or 2. Single User state backup/restore failure - which can be local to a particular user, e.g. another user's backup/restore might still succeed. The likelihood of failure significantly decreases after the initial few users have been successfully migrated. 2. Composable: admins can migrate a single user at a time or a batch of users simultaneously. We could easily create workflows to the backup shared state at a higher or lower cadence than users. 3. Monitoring: The design allows for more granular status and progress monitoring. 4. Availability: Backup of a user should not impact service for other users. **Cons** 1. Maintainability and Compatibility: This design depends on the APIs present in the individual components and our understanding of the state and data within them. However, our understanding and, therefore, the implementation may need to be updated with version upgrades. Hence, this backup/restore solution is only compatible with a limited number of component versions. 2. Feasibility: This design, although more evolved, will also require more building, maintenance, and support. ### Restful Interface Approach #3 The last two designs include the backup and restore functionality in Nebari. The central assumption was that Nebari should be able to back up and restore itself. However, thanks to helpful comments in this RFD, this design challenges this premise and proposes an alternative solution. This design breaks the implementation into two: the interface and the strategy. It argues that Nebari should only provide the interface for importing/exporting data. The backup and restore strategy should be part of the client code. We can extend the interface by providing a Python library. ```mermaid block-beta columns 1 j["Client Script (Backup strategy maintained by Nebari Admin)"] blockArrowId6<["   "]>(updown) L["Nebari backup and restore library (Python package)"] blockArrowId7<["   "]>(updown) D["Nebari Backup and restore REST API"] blockArrowId6<["   "]>(updown) block:ID A["Conda Store REST API"] B["User DATA REST API"] B2["Keycloak REST API"] end ``` The idea is simple: instead of building a backup and restoring _**service**_, we could build a backup and restore _**interface**_. The only job of this interface will be to provide users' state and data to authenticated users outside Nebari. The entire backup and restore logic can be built and maintained outside Nebari. This backup and restore client can then be run from anywhere, providing Admins with flexibility that other designs do not offer. ```mermaid flowchart LR subgraph Backup and restore library Client end Client-->I subgraph Nebari I I[Backup and restore interface API]-->K[Keycloak API] I-->C[Conda store API] I-->J[JupiterHub API] I-->N[User data API] end ``` #### Serializable vs Non-Serializable data An essential requirement for this design is to expose data and state. APIs like Keycloak and conda-store API already provide the bulk of serializable states. However, not all states are serializable, e.g., user data and conda packages. In this case, the design recommends APIs to download location URLs. APIs in Nebari could be completely stateless. Let's look at a few transactions with this proposed API. Serializable data ```mermaid sequenceDiagram Client->>API: GET /users API-->>Keycloak: GET /admin/realms/{realm}/users/ Keycloak-->>API: [A, B, C] API-->>Client: [A, B, C] ``` Non-Serializable data ```mermaid sequenceDiagram Client->>API: GET /users/A/environments API-->>conda-store: GET /api/v1/environment/?namespace={..} conda-store-->>API: [E1, E2, E3 ...] API->>Client: [{envs:[E1, E2]}] Client-->NFS: FTP/Rsync/Restic FETCH Artifact from E1, E2, E3 ... ``` Let's see the pros and cons of this design. **Pros** 1. Flexibility: Nebari admins can write their custom backup strategy based on the organization's needs. 2. Observability: This solution gives Nebari Admins a unique insight into the inner workings of Nebari. It makes it less opaque and, thus more 3. Compliance: The authenticated admins are responsible for enforcing compliance with company policies. 4. Availability: If the client interface is well-developed, this solution can achieve the highest level of availability. 5. Clear division of responsibility: Responsibilities of Component API, Nebari backup, and restore API, client library, and client code. **Cons** 1. Maintainability & Support: This approach moves the complexity of the backup/restore strategy outside Nebari. It now requires Nebari admins to know and understand the inner workings of Nebari. 2. Flexibility: A misconfigured client script can wreak havoc with the Nebari ecosystem. ## Design Discussion ### Possible options Each of the above-discussed designs has its pros and cons. We could also extend the designs.For example,we could extend Approach#2 and#1 via an API toprovide simple interfaces like/users/{uid}/backup/keycloak. Let's look at a few possible options we can vote on. More suggestions welcome. 1. Option#1: Start with [Restful Approach #3](#restful-interface-approach-3) to enable power users in the first iteration. Then extend this to [Sliced Approach #2](#vertical-slices-per-user-migration-approach-2) for normal users. 2. Option#2: Implement [Bulk Backup Approach #1](#backup-and-restore-by-component-approach-1) in the first iteration evolve it to [Sliced Approach #2](#vertical-slices-per-user-migration-approach-2) by exposing an API in second iteration. 3. Option#3: Implement [Bulk Backup Approach #1](#backup-and-restore-by-component-approach-1). 4. Option#4: Implement [Sliced Approach #2](#vertical-slices-per-user-migration-approach-2). 5. Option#5: Implement [Restful Approach #3](#restful-interface-approach-3). ### Special note about conda-store Conda store is one of the more complicated pieces to replicate among the Nebari components. We will need to work with conda-store team to come up with a detailed plan on backup-restore. But, here is a initial analysis based on conda-store docs. >The S3 server is used to store all build artifacts for example logs, docker layers, > and the Conda-Pack tarball. The PostgreSQL database is used for storing all states > on environments and builds along with powering the conda-store web server UI, REST > API, and Docker registry. Redis is used for keeping track of task state and results > along with enabling locks and realtime streaming of logs. #### The simplest approach (Compatible with [Approach #1](#backup-and-restore-by-component-approach-1)) Backup the object storage and dump the database. Restore would be reverse. We might have to ensure that database location entries for artifacts and Conda-pack are pointing to the right location. This might involve simple find and replace operations to the SQL dump. #### Approach per user (Can be used in [Approach #2](#vertical-slices-per-user-migration-approach-2) and [Restful Approach #3](#restful-interface-approach-3)) ![Image](https://github.com/users/pt247/projects/1/assets/8033215/b56cf6c5-c5c9-4a15-a39e-2a06b29afcaf) - Getting the shared state: - Get and SQLDump of entire conda-store - Mark all entries in `environment` as deleted by setting `deleted_on` field. - Get global name spaces, for each - get related `environment`s and reset `deleted_on` to make them available. - for each environment - Get build artifacts to backup `environment` -> `build` -> `build_conda_package_build` -> `conda_package_build` - Backup artifacts from source. - Getting the user state: - Same as getting the shared state except the namespace will be of the given user. Please note: 1. We need to get conda-store team to review this. But it gives a general idea 2. Most of this flow can be done via API except changing environments delete status. 3. We will need to create as separate RFD for Conda store. ## Relevant links: 1. https://www.nebari.dev/docs/how-tos/manual-backup 2. https://www.keycloak.org/server/importExport 3. https://argoproj.github.io/workflows/ 4. https://www.keycloak.org/docs-api/22.0.1/rest-api/index.html#_users 5. https://conda.store/conda-store/references/api ## Unresolved questions: 1. Which design is most suitable? 2. Is there a hybrid design that we can develop iteratively?
viniciusdc commented 6 months ago

I just finished reading @pt247; this looks great! Some considerations below:


Keycloak manages user authentication. There is a recommended way of backup and restore recommended Keycloak docs - link

I would suggest doing this differently, as interacting with the kc client is troublesome, and we only care about the users and groups. This could be handled by API requests directly in a moderate way.


nebari backup user-data --backup-location

If we do end up having this structure, I prefer that those commands (user-data, user-creds) are not exposed directly to the user (similar to what nebari render is called right now). The user should only handle this manually if the general backup fails the middle trough.


Scheduled backup of Nebari config: First, we extend the existing Nebari configuration file to provide a backup schedule to the Argo workflow template. The Argo template will encrypt the Nebari config and back it up.

We already saved the kubeconfig as a secret on Kubernetes; we could reuse that as part of this and enable versioning for that secret.

viniciusdc commented 6 months ago

I also have a question: would we expect the S3 or storage to be managed by Nebari's terraform during the first deployment, or would the user be responsible for that? (I do prefer the later, though we would need to make sure the cluster roles have access to that :smile: )

pt247 commented 6 months ago

I also have a question: would we expect the S3 or storage to be managed by Nebari's terraform during the first deployment, or would the user be responsible for that? (I do prefer the later, though we would need to make sure the cluster roles have access to that 😄 )

That's a good point. The backup location should not be managed by Nebari, but Nebari should have access and rights to write to the location. I will clarify this in RFD.

pt247 commented 6 months ago

If we do end up having this structure, I prefer that those commands (user-data, user-creds) are not exposed directly to the user (similar to what nebari render is called right now). The user should only handle this manually if the general backup fails the middle trough.

You are right; it's simpler to implement a catch-all backup everything command. But, Admin, for good reasons, might be interested in backing up only specific components, for example, to back up user data only.

viniciusdc commented 6 months ago

Some of the main points from our most recent discussion on the matter:

My 50 cents

We'll first need to discuss the data needed for state restoration and ensure each component is clearly defined in its role within the backup and restore operations. For instance:

Furthermore, addressing the dependencies and interactions between services during the backup and restore processes is essential. For example, restoring Keycloak user data and groups should ideally precede the restoration of corresponding directories to maintain coherence.

Finally, our discussions have highlighted the importance of individually mapping out each service's backup and restore processes before we consider how to orchestrate these processes.

flowchart TD    
    B(Orchestrator)
    C(NFS) --> B
    D(Keycloak) --> B
    E(Grafana?) --> B
    S(Conda Store) --> B

While managing other services solely through APIs is feasible, the same cannot be said for the EFS structure, which needs to be considered as its category. As part of this RFD, we need to include the data that will be targeted as part of these stored components. Ideally, this would be facilitated through endpoints if we expose them somehow.

Let's leverage the existing CLI command descriptions already presented in this RFD to ensure that any system we implement in the future can communicate in a way that our CLI—or other necessary tools—can effectively manage.

Regarding data export versus backup

Exporting data in a serializable format does not necessarily ensure a complete service restoration to its previous state.

To better define these distinctions, it's essential to evaluate the behavior of each service. Exporting state data from one version of a service to another could restore the previous structural identity of the service but not suffice to promote the same state it was in. If classified as backup/restore, importing and exporting should ideally match the service's original structure level and state. Suppose the provided files fail to restore the original state. In that case, the process should not be considered a backup/restore but a mere export/import—often due to the service's limitations or the incompleteness of the files or sources used to "restore" it.

In discussing the RFD, we aim to identify and standardize these necessary components and files, ensuring that our state data are sufficient to equate importing/exporting with backup/restore as much as possible. In scenarios where the service offers robust API support and effectively handles new data, the distinction between backup and export becomes less significant and often negligible.

For example, although listing and restoring YAML files of namespaced environments from the Conda store might enable us to use these environments again (by rebuilding), this action does not replicate the original "build" of those same environments. As discussed, it also does not leverage the previous builds unless we manage to store all the available databases within it; in my opinion, I would prefer that the conda-store handled that by itself, and we could work together to develop such usability, but we also need to consider what we can do now.

However, this may only be the case for some services; for instance, Keycloak could adequately support backup and restore through simple import/export functions.

tylergraff commented 6 months ago

The comments by @viniciusdc are well organized and point the effort in a good direction. I propose the following principles and tactical plan for implementation.

Core Considerations

Nebari is a modular and configurable collection of disparate OSS components. This implies certain principles related to the backup/restore effort:

Tactical Plan

All APIs should be implemented as REST endpoints using administrator access tokens for authentication and accessible only within the VPC. Core atomic API capabilities:

Order of implementation:

1.) User accounts (highest priority because these cannot be recreated) Schema: username -> [password, [first-name, last-name], [groups] ]

2.) Conda Environments (high priority as these would be very difficult to recreate) Schema: environment name -> [ [package name, version, hash, source URL, retrieval date] ]

3.) User code, notebooks, apps Nebari should be configured to access and store user-created content via git repos. Reliability should be handled externally via integration with a git provider (github, gitlab, etc). This is a well-solved problem served by mature tooling and processes.

4.) Nebari deployment-wide asynchronous (e.g. cron) jobs Recurring / Cron jobs should be implemented within the platform as user-create apps and stored in git repos accordingly.

dcmcand commented 6 months ago

Both @viniciusdc and @tylergraff have some excellent thoughts here.

I agree with @tylergraff that having a standardized interface for backups that we can implement for each service is a good plan. That will improve the devex and make things far easier as far as maintainability. @tylergraff 's proposed api endpoints would certainly provide coverage, but I would suggest that we go even simpler to start. Just have a /backup/keycloak endpoint that requires a admin token to access and takes an optional s3 compatible location as an argument. If the location is given, the files are written there. If not, they are just returned to the caller. That would be the simplest implementation imo.

I also agree with @viniciusdc that we should utilize built in backup mechanisms whenever possible. Keycloak already provides options for backup and restore which can be accessed through its rest API. Rather than reinvent the wheel here, we should wrap the functionality so that it implements our backup interface.

For prioritization, I also agree with @tylergraff. We should first ensure that each service has backup and restore functionality before worrying about any kind of orchestration between backups.

Users and groups is obvious for our first backup target, and would be really straightforward to implement since it would just be wrapping keycloak's restapi.

After that, I would agree with conda-store next. I think conda-store backup should just be a backup of the blob storage in some form and a dump of the postgres db to start with.

Finally the nfs file system, which I think we can just do a tarball of.

Restores could be the reverse.

This is not an endstate, but would represent an MVP implementation which would allow users to try out and we could learn a lot from it. Being an MVP it will also be cheaper and quicker to implement while (hopefully) avoiding going too far down any incorrect paths.

tylergraff commented 6 months ago

I agree with re-using / wrapping existing capabilities, provided that the wrapper adopts a standardized authentication token pattern which would be used across future endpoints.

I'm not convinced of adding an optional S3 bucket for user backups. This adds S3 authentication implementation and administration. It also implies that a single operation would serialize all users to that S3 bucket. Bulk actions can introduce ornery complexities, such as: how to handle a fatal error which occurs after some of the users were backed up to S3? How would we get debug insight into (potentially) which individual user account caused the error?

My opinion is that we should implement list-all, serialize, and deserialize operations only; the latter two operate on single elements (e.g. users). Client-side tooling can perform S3 uploads separately and in a more modular fashion.

pt247 commented 6 months ago

From the comments, I can conclude the following:

  1. We all agree that each component in Nebari will have a different mechanism of backup and restore.
  2. We also see the importance of starting with Keycloak, which looks like a low-hanging fruit.
  3. I agree with @tylergraff that an "Atomic "snapshot in time" of a Nebari deployment is not feasible ..."
  4. I agree on a few more points, but I would like to start with Keycloak

Let's start with the requirements for Keycloak. I have a few questions:

  1. Why is the ability to serialize/deserialize Keycloak data useful?

    Ex: deserialization of user content should tolerate non-existent users (and vice-versa). @tylergraff Is the plan to use nebari restore to add new users?

  2. @dcmcand has an interesting suggestion of simply adding an endpoint /backup/keycloak. I think it's a great idea. And we should do it. However I am still not convinced that wrapping the Keycloak API is the simplest approach. Simplest approach IMHO is to simply backup the entire database and restore from that instead. Let's have a look at all the options: 2.1. Keycloak REST API - docs
    • Pros: For the given version, we can backup and restore using the same shared codebase. Data can be serialized making it easier to edit/amend if needed. For example, adding or removing users.
    • Cons: Upgrading Keycloak can result in API changes, which will break backup and restore. We need to have a good understanding of how to replicate Keycloak using API. This includes all the relations of groups, users, etc.
      2.2. Importing and exporting realms - docs
    • Pros: It sounds straightforward to implement. (But I don't know if backing just the realms is enough.)
    • Cons: I am not sure if this output is serializable.
      2.3. Database Dump - blog
    • Pros: Easy to implement. Data can be zipped, encrypted, and stored in object storage. Easy to make it security compliant.
    • Cons: Not exactly serializable. If users are added in the destination DB that are not in the dump, they will be deleted. Thus, having a maintenance window becomes necessary.
      @dcmcand: Should we try the Database dump approach first? Or would you recommend we try Keycloak API?

PS: @tylergraf, I am going through your last comment just now. Can you explain in the case of Keycloak what you would like to see in "list-all, serialize, and deserialize"?

pt247 commented 6 months ago

I'm not convinced of adding an optional S3 bucket for user backups. This adds S3 authentication implementation and administration. It also implies that a single operation would serialize all users to that S3 bucket. Bulk actions can introduce ornery complexities, such as: how to handle a fatal error which occurs after some of the users were backed up to S3? How would we get debug insight into (potentially) which individual user account caused the error?

I agree; whatever solution we pick, it needs to back up all or nothing. Luckily, pg_dump behaves like that. So, in case of failures, we can have API report the status of backup as failed with reason.
We can always add an option to download the backup asset locally instead of S3. Will that help?

My opinion is that we should implement list-all, serialize, and deserialize operations only; the latter two operate on single elements (e.g. users). Client-side tooling can perform S3 uploads separately and in a more modular fashion.

We can expose Keycloak REST API to authenticated admins. This will allow admins to write Client-side tooling to manager uses as needed, for, e.g., adding or removing users.

tylergraff commented 6 months ago

Why is the ability to serialize/deserialize Keycloak data useful? ... what you would like to see in "list-all, serialize, and deserialize"?

Let me explain my reasoning and address those together:

My team's DR approach is to incrementally re-build a new Nebari deployment which can be used productively by our customers throughout that rebuild process. We are comfortable with this and are looking to minimize the risk and time (in that order) involved. We are not looking to precisely duplicate a Nebari deployment or its contents. We see substantial risk in the precise replication of internal Nebari state: internal state is opaque to us, may itself be the root cause of a disaster, or may cause a new disaster due to opaque consistency issues with other components. We know that deploying Nebari in an incremental fashion is low risk, because it is something we do frequently.

Our current DR approach is almost entirely manual, and we would like to improve by using automation to decrease the time involved. To reduce risk, it is critical that we retain visibility into (and thus confidence in) the changes effected by automation. We desire an approach of incremental modification, which allows us to understand changes and tailor risk. We want to maximize the observability of system state, allowing the effects of modification to be understood by administrators (who are likely learning as they go). And we’d like to decouple changes, to reduce the risk of unintended consequences.

To answer your questions:

viniciusdc commented 5 months ago

After reviewing the latest RFD contents and reflecting on our internal discussions and community feedback, Approach 3 seems most suited to our needs. As @tylergraff noted:

We see substantial risk in the precise replication of the internal Nebari state: the internal state is opaque to us, may itself be the root cause of a disaster, or may cause a new disaster due to opaque consistency issues with other components.

Fully replicating Nebari's state can reintroduce the problems that necessitate a restoration, making it a challenging option.

However, I also see significant merits in Approach 2, especially when we consider 'user' as the basic unit for the backup/restore process. This approach offers the flexibility to restart the process after encountering any errors or exceptions, which is a limitation of the bulk process. Nevertheless, this should not be viewed as a separate approach IMO. If we proceed with the REST API approach (Approach 3), we can incorporate both bulk and per-user import/export endpoints.

This integration allows us to optimize the workflow for backup/restore processes, which the user should consider.

In conclusion, I think everyone seems to be on the same page regarding the serialization and endpoints approach, and this should now be voted as is, and follow-up tasks can be created to start implementation details discussions.

dcmcand commented 4 months ago

Thanks to everyone for their feedback here. Based on this discussion, We will be moving forward with approach 3.

Currently state is in 3 main places:

  1. Keycloack - stores users, groups, and permissions
  2. Conda-store - stores conda environments and builds
  3. User storage - stores user data including code and datasets

We will create a backup controller within Nebari which will expose backup and restore routes for each of these services. Specifics of each service's backup and restore will be decided on a per service basis and will be handled in individual tickets. There seems to be broad consensus that it makes sense to start with keycloak as the first service to implement this on. @pt247 will open tickets for backing up and restore for each service and we can have specific discussions on the implementation details on those tickets.