Closed pt247 closed 4 months ago
Status | Open for comments 💬 |
---|---|
Author(s) | @pt247 |
Date Created | 20-04-2024 |
Date Last updated | 05-05-2024 |
Decision deadline | 22-05-2024 |
I just finished reading @pt247; this looks great! Some considerations below:
Keycloak manages user authentication. There is a recommended way of backup and restore recommended Keycloak docs - link
I would suggest doing this differently, as interacting with the kc
client is troublesome, and we only care about the users and groups. This could be handled by API requests directly in a moderate way.
nebari backup user-data --backup-location
If we do end up having this structure, I prefer that those commands (user-data
, user-creds
) are not exposed directly to the user (similar to what nebari render
is called right now). The user should only handle this manually if the general backup fails the middle trough.
Scheduled backup of Nebari config: First, we extend the existing Nebari configuration file to provide a backup schedule to the Argo workflow template. The Argo template will encrypt the Nebari config and back it up.
We already saved the kubeconfig as a secret on Kubernetes; we could reuse that as part of this and enable versioning for that secret.
I also have a question: would we expect the S3 or storage to be managed by Nebari's terraform during the first deployment, or would the user be responsible for that? (I do prefer the later, though we would need to make sure the cluster roles have access to that :smile: )
I also have a question: would we expect the S3 or storage to be managed by Nebari's terraform during the first deployment, or would the user be responsible for that? (I do prefer the later, though we would need to make sure the cluster roles have access to that 😄 )
That's a good point. The backup location should not be managed by Nebari, but Nebari should have access and rights to write to the location. I will clarify this in RFD.
If we do end up having this structure, I prefer that those commands (user-data, user-creds) are not exposed directly to the user (similar to what nebari render is called right now). The user should only handle this manually if the general backup fails the middle trough.
You are right; it's simpler to implement a catch-all backup everything command. But, Admin, for good reasons, might be interested in backing up only specific components, for example, to back up user data only.
Some of the main points from our most recent discussion on the matter:
We'll first need to discuss the data needed for state restoration and ensure each component is clearly defined in its role within the backup and restore operations. For instance:
Furthermore, addressing the dependencies and interactions between services during the backup and restore processes is essential. For example, restoring Keycloak user data and groups should ideally precede the restoration of corresponding directories to maintain coherence.
Finally, our discussions have highlighted the importance of individually mapping out each service's backup and restore processes before we consider how to orchestrate these processes.
flowchart TD
B(Orchestrator)
C(NFS) --> B
D(Keycloak) --> B
E(Grafana?) --> B
S(Conda Store) --> B
While managing other services solely through APIs is feasible, the same cannot be said for the EFS structure, which needs to be considered as its category. As part of this RFD, we need to include the data that will be targeted as part of these stored components. Ideally, this would be facilitated through endpoints if we expose them somehow.
Let's leverage the existing CLI command descriptions already presented in this RFD to ensure that any system we implement in the future can communicate in a way that our CLI—or other necessary tools—can effectively manage.
Exporting data in a serializable format does not necessarily ensure a complete service restoration to its previous state.
To better define these distinctions, it's essential to evaluate the behavior of each service. Exporting state data from one version of a service to another could restore the previous structural identity of the service but not suffice to promote the same state it was in. If classified as backup/restore, importing and exporting should ideally match the service's original structure level and state. Suppose the provided files fail to restore the original state. In that case, the process should not be considered a backup/restore but a mere export/import—often due to the service's limitations or the incompleteness of the files or sources used to "restore" it.
In discussing the RFD, we aim to identify and standardize these necessary components and files, ensuring that our state data are sufficient to equate importing/exporting with backup/restore as much as possible. In scenarios where the service offers robust API support and effectively handles new data, the distinction between backup and export becomes less significant and often negligible.
For example, although listing and restoring YAML files of namespaced environments from the Conda store might enable us to use these environments again (by rebuilding), this action does not replicate the original "build" of those same environments. As discussed, it also does not leverage the previous builds unless we manage to store all the available databases within it; in my opinion, I would prefer that the conda-store handled that by itself, and we could work together to develop such usability, but we also need to consider what we can do now.
However, this may only be the case for some services; for instance, Keycloak could adequately support backup and restore through simple import/export functions.
The comments by @viniciusdc are well organized and point the effort in a good direction. I propose the following principles and tactical plan for implementation.
Nebari is a modular and configurable collection of disparate OSS components. This implies certain principles related to the backup/restore effort:
All APIs should be implemented as REST endpoints using administrator access tokens for authentication and accessible only within the VPC. Core atomic API capabilities:
Order of implementation:
1.) User accounts (highest priority because these cannot be recreated) Schema: username -> [password, [first-name, last-name], [groups] ]
2.) Conda Environments (high priority as these would be very difficult to recreate) Schema: environment name -> [ [package name, version, hash, source URL, retrieval date] ]
3.) User code, notebooks, apps Nebari should be configured to access and store user-created content via git repos. Reliability should be handled externally via integration with a git provider (github, gitlab, etc). This is a well-solved problem served by mature tooling and processes.
4.) Nebari deployment-wide asynchronous (e.g. cron) jobs Recurring / Cron jobs should be implemented within the platform as user-create apps and stored in git repos accordingly.
Both @viniciusdc and @tylergraff have some excellent thoughts here.
I agree with @tylergraff that having a standardized interface for backups that we can implement for each service is a good plan. That will improve the devex and make things far easier as far as maintainability. @tylergraff 's proposed api endpoints would certainly provide coverage, but I would suggest that we go even simpler to start. Just have a /backup/keycloak
endpoint that requires a admin token to access and takes an optional s3 compatible location as an argument. If the location is given, the files are written there. If not, they are just returned to the caller. That would be the simplest implementation imo.
I also agree with @viniciusdc that we should utilize built in backup mechanisms whenever possible. Keycloak already provides options for backup and restore which can be accessed through its rest API. Rather than reinvent the wheel here, we should wrap the functionality so that it implements our backup interface.
For prioritization, I also agree with @tylergraff. We should first ensure that each service has backup and restore functionality before worrying about any kind of orchestration between backups.
Users and groups is obvious for our first backup target, and would be really straightforward to implement since it would just be wrapping keycloak's restapi.
After that, I would agree with conda-store next. I think conda-store backup should just be a backup of the blob storage in some form and a dump of the postgres db to start with.
Finally the nfs file system, which I think we can just do a tarball of.
Restores could be the reverse.
This is not an endstate, but would represent an MVP implementation which would allow users to try out and we could learn a lot from it. Being an MVP it will also be cheaper and quicker to implement while (hopefully) avoiding going too far down any incorrect paths.
I agree with re-using / wrapping existing capabilities, provided that the wrapper adopts a standardized authentication token pattern which would be used across future endpoints.
I'm not convinced of adding an optional S3 bucket for user backups. This adds S3 authentication implementation and administration. It also implies that a single operation would serialize all users to that S3 bucket. Bulk actions can introduce ornery complexities, such as: how to handle a fatal error which occurs after some of the users were backed up to S3? How would we get debug insight into (potentially) which individual user account caused the error?
My opinion is that we should implement list-all, serialize, and deserialize operations only; the latter two operate on single elements (e.g. users). Client-side tooling can perform S3 uploads separately and in a more modular fashion.
From the comments, I can conclude the following:
Let's start with the requirements for Keycloak. I have a few questions:
Ex: deserialization of user content should tolerate non-existent users (and vice-versa). @tylergraff Is the plan to use
nebari restore
to add new users?
/backup/keycloak
. I think it's a great idea. And we should do it. However I am still not convinced that wrapping the Keycloak API is the simplest approach. Simplest approach IMHO is to simply backup the entire database and restore from that instead. Let's have a look at all the options:
2.1. Keycloak REST API - docs
PS: @tylergraf, I am going through your last comment just now. Can you explain in the case of Keycloak what you would like to see in "list-all, serialize, and deserialize"?
I'm not convinced of adding an optional S3 bucket for user backups. This adds S3 authentication implementation and administration. It also implies that a single operation would serialize all users to that S3 bucket. Bulk actions can introduce ornery complexities, such as: how to handle a fatal error which occurs after some of the users were backed up to S3? How would we get debug insight into (potentially) which individual user account caused the error?
I agree; whatever solution we pick, it needs to back up all or nothing. Luckily, pg_dump behaves like that. So, in case of failures, we can have API report the status of backup as failed with reason.
We can always add an option to download the backup asset locally instead of S3. Will that help?
My opinion is that we should implement list-all, serialize, and deserialize operations only; the latter two operate on single elements (e.g. users). Client-side tooling can perform S3 uploads separately and in a more modular fashion.
We can expose Keycloak REST API to authenticated admins. This will allow admins to write Client-side tooling to manager uses as needed, for, e.g., adding or removing users.
Why is the ability to serialize/deserialize Keycloak data useful? ... what you would like to see in "list-all, serialize, and deserialize"?
Let me explain my reasoning and address those together:
My team's DR approach is to incrementally re-build a new Nebari deployment which can be used productively by our customers throughout that rebuild process. We are comfortable with this and are looking to minimize the risk and time (in that order) involved. We are not looking to precisely duplicate a Nebari deployment or its contents. We see substantial risk in the precise replication of internal Nebari state: internal state is opaque to us, may itself be the root cause of a disaster, or may cause a new disaster due to opaque consistency issues with other components. We know that deploying Nebari in an incremental fashion is low risk, because it is something we do frequently.
Our current DR approach is almost entirely manual, and we would like to improve by using automation to decrease the time involved. To reduce risk, it is critical that we retain visibility into (and thus confidence in) the changes effected by automation. We desire an approach of incremental modification, which allows us to understand changes and tailor risk. We want to maximize the observability of system state, allowing the effects of modification to be understood by administrators (who are likely learning as they go). And we’d like to decouple changes, to reduce the risk of unintended consequences.
To answer your questions:
We already have and use the capability to add (deserialize) individual users via endpoint. This is part of our current DR approach, it is low-risk, and we would like to further build on this.
A new capability to list existing users gives us clear visibility into that aspect of a deployment and a starting point to reproduce access to a new deployment.
A new capability to serialize [a critical subset of] a user’s account gives us a solution for user backup/restore that provides flexibility, visibility, and confidence in system state and operation. This also gives us the ability to audit users and/or migrate them to other systems, which could be valuable troubleshooting tools.
Providing these capabilities via Nebari serialize/deserialize endpoints (vs database dump) achieves the goals outlined above, and allows for automation without the need to generate database backup images nor rely on tooling and expertise to inspect them. We also get the ability to easily migrate users to other (potentially newer) systems without performing database migrations on software for which we have minimal experience. This approach also reduces the risk that a restored database contains the root cause of the original disaster, or otherwise introduces a new disaster via internal inconsistencies which are opaque to administrators.
After reviewing the latest RFD contents and reflecting on our internal discussions and community feedback, Approach 3 seems most suited to our needs. As @tylergraff noted:
We see substantial risk in the precise replication of the internal Nebari state: the internal state is opaque to us, may itself be the root cause of a disaster, or may cause a new disaster due to opaque consistency issues with other components.
Fully replicating Nebari's state can reintroduce the problems that necessitate a restoration, making it a challenging option.
However, I also see significant merits in Approach 2, especially when we consider 'user' as the basic unit for the backup/restore process. This approach offers the flexibility to restart the process after encountering any errors or exceptions, which is a limitation of the bulk process. Nevertheless, this should not be viewed as a separate approach IMO. If we proceed with the REST API approach (Approach 3), we can incorporate both bulk and per-user import/export endpoints.
This integration allows us to optimize the workflow for backup/restore processes, which the user should consider.
In conclusion, I think everyone seems to be on the same page regarding the serialization and endpoints approach, and this should now be voted as is, and follow-up tasks can be created to start implementation details discussions.
Thanks to everyone for their feedback here. Based on this discussion, We will be moving forward with approach 3.
Currently state is in 3 main places:
We will create a backup controller within Nebari which will expose backup and restore routes for each of these services. Specifics of each service's backup and restore will be decided on a per service basis and will be handled in individual tickets. There seems to be broad consensus that it makes sense to start with keycloak as the first service to implement this on. @pt247 will open tickets for backing up and restore for each service and we can have specific discussions on the implementation details on those tickets.