[ENH] - Add Argo Workflows integration to Qhub

Adam-D-Lewis commented 2 years ago

This issue will document the progress/choices made in integrating Argo Workflows with QHub.

Adam-D-Lewis commented 2 years ago

I've read these resources:

Some of my takeaways are:

I think we should do a namespace or managed namespace install (Argo can only execute workflows in 1 namespace rather than on all namespaces) of Argo Workflows since I've seen deployments of Qhub which occur alongside unrelated k8s client resources. So doing a namespace install would help maintain some isolation there. If we do so, we'd need to not allow the users to create workflows in the same namespace as the Argo Deployment since this introduces the ability for the user to create workflows which modify the argo workflow controller itself (according to Argo Workflows docs). So maybe we deploy Qhub in e.g. dev and all the workflows run in dev-workflows namespace.
The docs also say that using emissary executor + containerset is more secure than using a Docker executor since it can't run as root, and faster since all the containers in a workflow run on the same pod, and don't need to communicate through the k8s api. This could be good to try out, but not as important. The default executor and pod security context can be set with default workflow spec (https://argoproj.github.io/argo-workflows/default-workflow-specs/).
TLS is disabled by default on Argo Workflows Helm chart. I imagine we should enable this at some point, but I think it'd be better to get it running without it first.

Argo Server supports SSO. Can and should we use Keycloak to enable users to log into the web UI? The helm options for configuration are below:

# -- SSO configuration when SSO is specified as a server auth mode.    
sso: {}    
## All the values are required. SSO is activated by adding --auth-mode=sso    
## to the server command line.    
#    
## The root URL of the OIDC identity provider.    
# issuer: https://accounts.google.com    
## Name of a secret and a key in it to retrieve the app OIDC client ID from.    
# clientId:    
#   name: argo-server-sso    
#   key: client-id    
## Name of a secret and a key in it to retrieve the app OIDC client secret from.    
# clientSecret:    
#   name: argo-server-sso    
#   key: client-secret    
## The OIDC redirect URL. Should be in the form <argo-root-url>/oauth2/callback.    
# redirectUrl: https://argo/oauth2/callback    
# rbac:    
#   enabled: true    
## When present, restricts secrets the server can read to a given list.    
## You can use it to restrict the server to only be able to access the    
## service account token secrets that are associated with service accounts    
## used for authorization.    
#   secretWhitelist: []    
## Scopes requested from the SSO ID provider.  The 'groups' scope requests    
## group membership information, which is usually used for authorization    
## decisions.    
# scopes:    
# - groups

Eventually, we'll want to enable Prometheus to collect data from Argo Server. There are some helm chart options around this as well.

There are further additions that we may want to add eventually like artifact repository, adding a database to offload large workflows (https://argoproj.github.io/argo-workflows/offloading-large-workflows/), but we can get to it later.

costrouc commented 2 years ago

@Adam-D-Lewis great points.

I think we should do a namespace or managed namespace install

Totally agree. In fact I would have liked to do this with other qhub resources as well. For example the jupyterlab and dask workers being launched for users. There isn't much reason they need to be in the same namespace as other pods. I think argo would be a good test of this.

The docs also say that using emissary executor + containerset is more secure than using a Doc...

I don't know enough about this. But I'd guess that after we have a stable argo deployment we could investigate this. We should wait to implement this if it adds any complexity for our deployment.

TLS is disabled by default on Argo Workflows Helm chart. I imagine we should enable this at some point, but I think it'd be better to get it running without it first.

I'd like all our components to communicate over tls.. but I don't know exactly how to do this. I agree this would be nice to have but not something we can implement in the short term.

Argo Server supports SSO. Can and should we use Keycloak to enable users to log into the web UI? The helm options for configuration are below:

Yes we should absolutely do this in our deployment and configure the roles/groups appropriately. I think this is essential to the complete deployment.

Eventually, we'll want to enable Prometheus to collect data from Argo Server. There are some helm chart options around this as well.

Yeup we should enable this and I think this is part of the complete deployment.

Adam-D-Lewis commented 2 years ago

I think there are many different points at which we could call this "done". I think the minimum done does the following:

Minimum addition of Argo Workflows

[x] Ability to log in to Argo server with Keycloak (under review)
[ ] Can deploy sample workflows in the UI
- [ ] long running
- [ ] scheduled
[x] Can view workflow results in UI
[ ] Docs explain how to do the above
[ ] Prometheus collects Argo Server stats
- [ ] blocked by https://github.com/Quansight/qhub/issues/1127
[ ] Reviewed and merged into main branch

I think we should add Argo in the following order. Minimum done is Step 1

Step 1

[x] Add default helm chart deployment to Nebari
- [x] expose Argo UI at /argo endpoint
[x] use Keycloak for Auth on Argo Server (under review)
[x] deploy argo in same namespace as rest of Qhub
[x] give UI permission to start workflows also (don't address risk of user creating a workflow that modifies argo at this time (cluster install has the same issue))
[x] Add docs on how to enable/use Argo Workflows in Nebari
[ ] Add prometheus integration + simple dashboard
- blocked by https://github.com/Quansight/qhub/issues/1127

Step 2

[ ] add default security context
- [ ] use emissary executor by default
set up argo roles
fine tune permissions
deploy argo in it's own namespace
- set up cross namespace communication so Prometheus can scrape Argo metrics still

Step 3

turn on TLS for Argo
Add Artifact repo
Add database

iameskild commented 2 years ago

Thanks a lot for putting this together @Adam-D-Lewis! This gives us a clear target to aim for.

I do have a question around how we would go about "Can deploy Rich's sample workflows in the UI", does this not depend on having some user-facing solution in place? Jupyterflow, Yason etc?

Adam-D-Lewis commented 2 years ago

@iameskild

I do have a question around how we would go about "Can deploy Rich's sample workflows in the UI", does this not depend on having some user-facing solution in place? Jupyterflow, Yason etc?

Good point, maybe we need to just use dummy workflows (long running and scheduled) for now. We'd run his workflows later when we get to adding a front end.

Adam-D-Lewis commented 2 years ago

@iameskild and I are getting reprioritized on another issue, but we hope to get this to a point where it's added into Qhub. However, because there is a blocker on the Prometheus integration above, I suggest we finish integrating the argo server deployment with keycloak's permissions, add some rudimentary docs and try to merge leaving Prometheus and step 2 and 3 above for future work.

trallard commented 2 years ago

AFAIK this is now enable (at least the "backend" or the orchestrator) does this issue need closing?

iameskild commented 2 years ago

I'm fine with closing this. @viniciusdc and I did some preliminary research for the "front-end" interface. We can open another issue to track that discussion 👍

iameskild commented 2 years ago

Closing as completed. Opening #1495 to track front-end proposals.

nebari-dev / nebari