lsc-sde / lsc-sde

Lancashire and South Cumbria Secure Data Environment
MIT License
1 stars 0 forks source link

redefine workspace management in jupyter hub #21

Closed qcaas-nhs-sjt closed 3 months ago

qcaas-nhs-sjt commented 3 months ago

Further to our conversation today it appears as though keycloak may not be the right solution for our needs and may in fact be over complicating what we need to accomplish. This ticket is to outline the architecture for a new workspace management solution that will ultimately remove the need for keycloak and can therefore go straight to using Entra ID (azure active directory) or indeed any SSO solution for managing user identity.

Option 1: CRD Model

The suggestion is that we build two CRD's, the first representing a workspace:

apiVersion: xlscsde.nhs.uk/v1
kind: JupyterWorkspace
metadata:
   name: example-workspace 
   namespace: jupyterhub
spec:
   displayName: Example Workspace
   description: |
      This would be a long description identifying the
      workspace and giving more details surrounding it
   timeframe:
      availableFrom: "2024-02-23" # The date that the workspace can be used from
      expires: "2025-02-23" # The date that the workspace expires
   kubespawner:
      image: lscsde/data-science-notebook:0.1.0
      extraLabels:
          "xlscsde.nhs.uk/exampleLabel": "testing 123"
   additionalStorage: []
status:
   statusText: Provisioned
   storage:
     default:
       name: example-workspace
       namespace: jupyterhub
     additional: []

The second CRD would be a workspace to user binding, that says which workspace a user has access to:

apiVersion: xlscsde.nhs.uk/v1
kind: JupyterWorkspaceBinding
metadata:
   name: example-workspace:shaun_turner1___nhs_net
   namespace: jupyterhub
spec:
   userPrincipalId: shaun.turner1@nhs.net
   workspace: example-workspace

In kubernetes we'd need to:

We'd create a new python module called kubespawner_workspace_mgmt this would provide methods for accessing these from kubernetes. This will have objects defining a structure equivalent to the above CRD's

# objects.py
class JupyterWorkspace:
   def __init__(self, workspace_as_map):
      self.display_name = workspace_as_map.get("displayName")
      self.description = workspace_as_map.get("description")
      ...

There will then be classes for reading these from kubernetes:

# k8sio.py
class JupyterWorkspaceClient:
   def __init__(self, api):
      ...
   def get_workspace(self, name, namespace):
      ...
   def get_workspaces(self, namespace):
      ...
   def get_workspaces_by_user(self, user_principal_id, namespace):
      ...

There would also be a helper class to make it easier for the kubespawner to interact with the client

# spawner.py
class KubespawnerWorkspaceManager:
  def __init__(self, spawner):
     ...
  def get_permitted_workspaces(self):
     ...

The new module would then be referenced by jupyterhub_custom_config.py:

# jupyterhub_custom_config.py
def get_workspaces(spawner):
   workspace_manager = KubespawnerWorkspaceManager(spawner)
   permitted_workspaces = workspace_manager.get_permitted_workspaces()

The advantage to using CRD's are:

The disadvantages are:

Option 2: Database

Another option is that we build a database to manage this instead, presumably on postgresql server:

CREATE TABLE IF NOT EXISTS jupyter_workspace.workspaces
(
 workspace_id bigint NOT NULL,
 display_name character varying COLLATE pg_catalog."default" not null,
...
);

CREATE TABLE IF NOT EXISTS jupyter_workspace.workspace_storage
(
 workspace_storage_id bigint NOT NULL,
 workspace_id bigint NOT NULL,
 display_name character varying COLLATE pg_catalog."default" not null,
...
);

CREATE TABLE IF NOT EXISTS jupyter_workspace.users
(
 workspace_id bigint NOT NULL,
 user_id bigint NOT NULL
);

CREATE TABLE IF NOT EXISTS jupyter_workspace.workspace_bindings
(
  user_id bigint NOT NULL,
  user_principal_id character varying COLLATE pg_catalog."default" NOT NULL
)

Like in option 1 we'd create a new python module called kubespawner_workspace_mgmt this would provide methods for accessing these from kubernetes. This will have objects defining a structure equivalent to the above CRD's

# objects.py
class JupyterWorkspace:
   def __init__(self, workspace_as_map):
      self.display_name = workspace_as_map.get("displayName")
      self.description = workspace_as_map.get("description")
      ...

There will then be classes for reading these from the database:

# k8sio.py
class JupyterWorkspaceClient:
   def __init__(self, connection_string):
      ...
   def get_workspace(self, name, namespace):
      ...
   def get_workspaces(self, namespace):
      ...
   def get_workspaces_by_user(self, user_principal_id, namespace):
      ...

There would also be a helper class to make it easier for the kubespawner to interact with the client

# spawner.py
class KubespawnerWorkspaceManager:
  def __init__(self, spawner):
     ...
  def get_permitted_workspaces(self):
     ...

The new module would then be referenced by jupyterhub_custom_config.py:

# jupyterhub_custom_config.py
def get_workspaces(spawner):
   workspace_manager = KubespawnerWorkspaceManager(spawner)
   permitted_workspaces = workspace_manager.get_permitted_workspaces()

We will then also need to develop a backend API to service these to a management portal via a rest API.

@app.post("/api/workspaces")
def create_or_update_workspace():
   ...

@app.get("/api/workspaces")
def list_workspaces():
   ...

@app.get("/api/workspaces/{id}")
def get_workspace():
   ...

@app.get("/api/workspaces/{id}/bindings")
def get_workspace_bindings():
   ...

@app.post("/api/workspaces/{id}/bindings")
def create_or_update_workspace_bindings():
   ...

We would then build a management portal for managing it.

The advantages to using a managed database are:

The disadvantages are:

qcaas-nhs-sjt commented 3 months ago

@vvcb further to our discussion today I've put together a quick overview of the options here

vvcb commented 3 months ago

@qcaas-nhs-sjt , thank you very much for putting this together. I sense that the CRD solution would be the simplest, most easily explainable and manageable solution of the three (with the third being KeyCloak) with easy extensibility built in.

From a maintenance POV, it certainly makes it easier for me as it keeps this within k8s.

From an audit trail POV, the CRDs can be version controlled and will fit in with the existing Flux-based workflow. And I love the idea of using the CRDs to do any number of additional operations on workspaces as you describe - all under version control.

This also solves the expiring token issue we have seen with federated auth between KeyCloak and Entra.

Looks like we have a clear winner.

qcaas-nhs-sjt commented 3 months ago

@vvcb great, I'll set about creating the tickets to get this work done. Can I assume that this will take priority over the work on the OHDSI applications?

qcaas-nhs-sjt commented 3 months ago

@vvcb Per our conversation this morning, this is the current priority

qcaas-nhs-sjt commented 3 months ago

This has been agreed and initial version is deployed