qcaas-nhs-sjt commented 3 months ago

Further to our conversation today it appears as though keycloak may not be the right solution for our needs and may in fact be over complicating what we need to accomplish. This ticket is to outline the architecture for a new workspace management solution that will ultimately remove the need for keycloak and can therefore go straight to using Entra ID (azure active directory) or indeed any SSO solution for managing user identity.

Option 1: CRD Model

The suggestion is that we build two CRD's, the first representing a workspace:

apiVersion: xlscsde.nhs.uk/v1
kind: JupyterWorkspace
metadata:
   name: example-workspace 
   namespace: jupyterhub
spec:
   displayName: Example Workspace
   description: |
      This would be a long description identifying the
      workspace and giving more details surrounding it
   timeframe:
      availableFrom: "2024-02-23" # The date that the workspace can be used from
      expires: "2025-02-23" # The date that the workspace expires
   kubespawner:
      image: lscsde/data-science-notebook:0.1.0
      extraLabels:
          "xlscsde.nhs.uk/exampleLabel": "testing 123"
   additionalStorage: []
status:
   statusText: Provisioned
   storage:
     default:
       name: example-workspace
       namespace: jupyterhub
     additional: []

The second CRD would be a workspace to user binding, that says which workspace a user has access to:

apiVersion: xlscsde.nhs.uk/v1
kind: JupyterWorkspaceBinding
metadata:
   name: example-workspace:shaun_turner1___nhs_net
   namespace: jupyterhub
spec:
   userPrincipalId: shaun.turner1@nhs.net
   workspace: example-workspace

In kubernetes we'd need to:

Create the CRD's for the above
Create the Roles allowing read and read/write access
Create the role bindings for jupyterhub

We'd create a new python module called kubespawner_workspace_mgmt this would provide methods for accessing these from kubernetes. This will have objects defining a structure equivalent to the above CRD's

# objects.py
class JupyterWorkspace:
   def __init__(self, workspace_as_map):
      self.display_name = workspace_as_map.get("displayName")
      self.description = workspace_as_map.get("description")
      ...

There will then be classes for reading these from kubernetes:

# k8sio.py
class JupyterWorkspaceClient:
   def __init__(self, api):
      ...
   def get_workspace(self, name, namespace):
      ...
   def get_workspaces(self, namespace):
      ...
   def get_workspaces_by_user(self, user_principal_id, namespace):
      ...

There would also be a helper class to make it easier for the kubespawner to interact with the client

# spawner.py
class KubespawnerWorkspaceManager:
  def __init__(self, spawner):
     ...
  def get_permitted_workspaces(self):
     ...

The new module would then be referenced by jupyterhub_custom_config.py:

# jupyterhub_custom_config.py
def get_workspaces(spawner):
   workspace_manager = KubespawnerWorkspaceManager(spawner)
   permitted_workspaces = workspace_manager.get_permitted_workspaces()

The advantage to using CRD's are:

They are native to kubernetes, developing a CRD is relatively quick and it already has storage and rbac, and audits etc surrounding it all.
The CRD functionality is core to kubernetes, so while it evolves it is unlikely that it will change so dramatically that we'd see breaking changes in this functionality.
It is also relatively easy to amend CRD's to add additional functionality as it comes up.
CRD schemas are relatively flexible so can cover off a multitude of sins with little actual development
We could start off managing the entries in flux itself, and develop a management portal for the solution at our leisure.
We could later potentially also create CRD's that automatically do clever actions when a workspace is created/updated/deleted. For example we could update the CRD to mark a workspace for archive, then create a controller that listens to changes and if a workspace is marked for archive that wasn't previously it could transfer the contents of it's workspace pvc into cold storage and tag it for deletion after a specific date.

The disadvantages are:

we'll have to ensure we have backups of these definition in case the cluster dies. That said we should be doing this anyway as we'd lose a lot more if we lost our PVC and actual volume definitions
CRD's will be significantly slower with high usage than a traditional RDBMS, however this will not be a heavily used segment of the system

Option 2: Database

Another option is that we build a database to manage this instead, presumably on postgresql server:

CREATE TABLE IF NOT EXISTS jupyter_workspace.workspaces
(
 workspace_id bigint NOT NULL,
 display_name character varying COLLATE pg_catalog."default" not null,
...
);

CREATE TABLE IF NOT EXISTS jupyter_workspace.workspace_storage
(
 workspace_storage_id bigint NOT NULL,
 workspace_id bigint NOT NULL,
 display_name character varying COLLATE pg_catalog."default" not null,
...
);

CREATE TABLE IF NOT EXISTS jupyter_workspace.users
(
 workspace_id bigint NOT NULL,
 user_id bigint NOT NULL
);

CREATE TABLE IF NOT EXISTS jupyter_workspace.workspace_bindings
(
  user_id bigint NOT NULL,
  user_principal_id character varying COLLATE pg_catalog."default" NOT NULL
)

Like in option 1 we'd create a new python module called kubespawner_workspace_mgmt this would provide methods for accessing these from kubernetes. This will have objects defining a structure equivalent to the above CRD's

# objects.py
class JupyterWorkspace:
   def __init__(self, workspace_as_map):
      self.display_name = workspace_as_map.get("displayName")
      self.description = workspace_as_map.get("description")
      ...

There will then be classes for reading these from the database:

# k8sio.py
class JupyterWorkspaceClient:
   def __init__(self, connection_string):
      ...
   def get_workspace(self, name, namespace):
      ...
   def get_workspaces(self, namespace):
      ...
   def get_workspaces_by_user(self, user_principal_id, namespace):
      ...

There would also be a helper class to make it easier for the kubespawner to interact with the client

# spawner.py
class KubespawnerWorkspaceManager:
  def __init__(self, spawner):
     ...
  def get_permitted_workspaces(self):
     ...

The new module would then be referenced by jupyterhub_custom_config.py:

# jupyterhub_custom_config.py
def get_workspaces(spawner):
   workspace_manager = KubespawnerWorkspaceManager(spawner)
   permitted_workspaces = workspace_manager.get_permitted_workspaces()

We will then also need to develop a backend API to service these to a management portal via a rest API.

@app.post("/api/workspaces")
def create_or_update_workspace():
   ...

@app.get("/api/workspaces")
def list_workspaces():
   ...

@app.get("/api/workspaces/{id}")
def get_workspace():
   ...

@app.get("/api/workspaces/{id}/bindings")
def get_workspace_bindings():
   ...

@app.post("/api/workspaces/{id}/bindings")
def create_or_update_workspace_bindings():
   ...

We would then build a management portal for managing it.

The advantages to using a managed database are:

It is fairly common in tradition application development models.
databases are performant due to their constrained nature

The disadvantages are:

Database schemas are tightly constrained meaning that they are rigid, this is beneficial in high transaction models such as OLTP, however it can also be a constraint our throughput for this service is not expected to be high unlikely that we will need that.
We will have to consider how our database build scripts are handled, and how to handle schema and data migrations in the events of schema changes.
We will need to build our own API to wrap around this for the backend and security models.
We are also fairly tied into building a UI for management from the beginning rather than being able to step into it (unless we trust people editing database entries directly.
We will need to build our own audit and access controls
We will also have to consider the impacts of updates on database products and drivers on our solutions for future maintenance.
Slower development pathway

qcaas-nhs-sjt commented 3 months ago

@vvcb further to our discussion today I've put together a quick overview of the options here

vvcb commented 3 months ago

@qcaas-nhs-sjt , thank you very much for putting this together. I sense that the CRD solution would be the simplest, most easily explainable and manageable solution of the three (with the third being KeyCloak) with easy extensibility built in.

From a maintenance POV, it certainly makes it easier for me as it keeps this within k8s.

From an audit trail POV, the CRDs can be version controlled and will fit in with the existing Flux-based workflow. And I love the idea of using the CRDs to do any number of additional operations on workspaces as you describe - all under version control.

This also solves the expiring token issue we have seen with federated auth between KeyCloak and Entra.

Looks like we have a clear winner.

qcaas-nhs-sjt commented 3 months ago

@vvcb great, I'll set about creating the tickets to get this work done. Can I assume that this will take priority over the work on the OHDSI applications?

qcaas-nhs-sjt commented 3 months ago

@vvcb Per our conversation this morning, this is the current priority

qcaas-nhs-sjt commented 3 months ago

This has been agreed and initial version is deployed

lsc-sde / lsc-sde

redefine workspace management in jupyter hub #21

Option 1: CRD Model

Option 2: Database