Setting up on Google Cloud

arokem commented 6 years ago

Hello! We are interested in setting up a Reana cluster on Google Cloud Platform (GCP).

We followed the instructions in the zero-to-jupyterhub documentation (https://zero-to-jupyterhub.readthedocs.io/en/stable/) to set up a Kubernetes cluster, and then followed the instructions here: https://reana-cluster.readthedocs.io/en/latest/gettingstarted.html#deploy-locally, but instead of using minikube, we pointed it to our cluster-in-the-clouds. Pretty quickly, we discovered that we can't write to /reana on these cloud machines (see: https://cloud.google.com/container-optimized-os/docs/concepts/security). All the pods come crashing down as soon as they try writing (into) this directory. So, we edited the provided default configuration (https://reana-cluster.readthedocs.io/en/latest/userguide.html#configure-reana-cluster) to point to /etc/reana, which is writeable. This solved most of the problems. The one remaining issue is that the database pod that is still crashing. The logs in this pod are:

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
FATAL:  could not write to file "pg_xlog/xlogtemp.29": No space left on device
child process exited with exit code 1
initdb: removing contents of data directory "/var/lib/postgresql/data"
running bootstrap script ...

Which suggest that maybe it's still trying to write to a disallowed location.

We're not neccessarily expecting you to fix this, if it's not currently on your road-map, but we thought it would be good to raise this, and at least document our experiments for future experimenters seeking guidance.

But of course: your thoughts would be appreciated. Thanks!

lukasheinrich commented 6 years ago

I've not been involved lately in the development, so I might not be of much help, but I'm pretty sure you need to be able to have distributed storage available. At CERN we use Volumes provided by CephFS which support the ReadWriteMany access mode (see this table https://kubernetes.io/docs/concepts/storage/persistent-volumes/ ) I think on GCP the only option available is Cloud Filestore, https://cloud.google.com/filestore/docs/accessing-fileshares, but I haven't tried this yet. Maybe @diegodelemos or @tiborsimko can comment whether a shared fs (or even Ceph) is still a hard requirement

In any case: happy to see people interested in deploying REANA, we'll try to help as much we can!

tiborsimko commented 5 years ago

Which suggest that maybe it's still trying to write to a disallowed location.

@arokem The DB pod error might be connected to writing to disallowed location indeed... We are using /reana and /reanadb locations in default configurations. Perhaps you have changed the former but not the latter?

$ git grep reanadb
reana_cluster/configurations/reana-cluster-dev.yaml:  db_persistence_path: "/reanadb"
reana_cluster/configurations/reana-cluster-latest.yaml:  db_persistence_path: "/reanadb"
reana_cluster/configurations/reana-cluster.yaml:  db_persistence_path: "/reanadb"

Alternatively, you could also switch to using a DB instance outside of the cluster.

P.S. We should perhaps switch /reana and /reanadb to some more reasonable defaults...

tiborsimko commented 5 years ago

I think on GCP the only option available is Cloud Filestore

Indeed REANA needs a shared filesystem at this stage. Support for distributed file systems, say S3, is in the plans later on.

We have not tried yet the installation on GCP but it would be definitely interesting to provide runnable configurations out of the box!

elibixby commented 2 years ago

FYI I am currently trying to get this running on GKE (v1.22).

Currently trying to get the barebones running (ingess, quota, etc turned off)

Some sticking points:

It's rather strange to use an unconfigurable storage class name. I already had RWX dynamic provisioning support in my cluster and ended up having to copy the storageclass config and change the name (rather than deploying a duplicate provisioner). Much better to allow the user to specify a storage class name and use current templated default.
hostPaths can't be used on GKE safely (and are generally discouraged for safety reasons), so the reana-workflow-controller pods are crashing due to the reana-code hostpath volume. Should be an easy fix allowing users to specify a storageclass for this as well (defaulting to hostPath). EDIT: Looks like I need to turn off debug to fix this.

I might have missed options that allow for this in the config. If you're interested in contributions I'd happily contribute some documentation etc if people can help me with PRs as I work out these issues.

Some far future things I'm interested in:

More flexible auth (e.g token based login)
Moving PostGres server out of cluster (e.g. using the Cloud SQL proxy)
More flexible quota management (e.g. with RBAC for externalized quota)
Allow use of existing volumes as workspaces (particularly useful for interacting with a customized jupyterhub running in the same cluster)

tiborsimko commented 2 years ago

@elibixby Thanks for reaching out! This issue is quite old, so let me share a short update about REANA-on-GKE status since 2018.

About a year or two ago we have tested a small REANA deployment on GKE, targeting mostly single-node deployment. The aim was just to test the general applicability of our Helm charts on various platforms. Everything worked well. This year we are just about to start work on a bigger GKE deployment for ATLAS physics use case (CC @lukasheinrich) which will need many nodes. So your message comes very timely!

Here are a few technical notes:

WRT Kubernetes 1.22, the current REANA does not support it yet because we are using older K8S APIs that were deprecated in 1.22. However, we have a fully working PR set that we should get to merging very soon, most probably next week after 0.8.1 is released.
WRT storage, we are definitely open to changes. The GKE deployment last year was done for single-node only, so using ephemeral/local storage. We would definitely need a shared filesystem for multi-node deployments, so it would be interesting to hear your plans about the best GKE shared storage options.
WRT auth, REANA currently offers either local accounts or CERN-specific SSO. However, CERN has a new OIDC-based authn/authz system in place, which we were thinking of migrating towards later in the year. If OIDC would be OK for your needs, there could be some synergy there, as for the storage need synergies.
WRT PostgreSQL, it is already possible to use a DB instance living outside of the cluster. That's our primary mode of deployment at CERN, actually. It should be sufficient to set Helm values.yaml variables in db_env_config for example:

db_env_config:
  REANA_DB_NAME: "reana"
  REANA_DB_HOST: "db.example.org"
  REANA_DB_PORT: "5432"

and then disable the "internal" reana-db component:

components:
  reana_db:
     enabled: false

and introduce corresponding secrets for REANA_DB_USERNAME and REANA_DB_PASSWORD:

secrets:
  database:
    user: *******
    password: *********

This should be enough to make DB-as-external-service usable. We can update our documentation with more detailed recipe if you are interested.

(BTW FWIW we have been using both DB-as-external-service and DB-as-internal-pod and the latter technique was working quite well for some of our deployments. But our primary mode of operation is DB-as-external-service as well.)

WRT quota management, we have not planned any concrete work on this in the near future; but we are definitely open to make it more flexible.
WRT using different volumes as workspaces for different users, we have done some preliminary work on abstracting workspace concept last summer. Achieving that would however require quite a lot of work still.

If you have some GKE documentation recipes and/or code to contribute, we'll be naturally happy to collaborate!

elibixby commented 2 years ago

WRT auth, REANA currently offers either local accounts or CERN-specific SSO. However, CERN has a new OIDC-based authn/authz system in place, which we were thinking of migrating towards later in the year. If OIDC would be OK for your needs, there could be some synergy there, as for the storage need synergies.

My ideal solution is an "authless" mode where I can put something like https://github.com/travisghansen/external-auth-server/ in front the API/UI and manage users quota and auth myself

A "nice to have" would be to allow mapping forwarded user IDs to namespaces and service accounts, to better isolate workflows from each-other, then use cluster quota

lukasheinrich commented 2 years ago

Hi @elibixby - thanks for your interest. As @tiborsimko said we're in the process of working with some folks in Google to deploy REANA @ GCP and it'd be great to learn more about your usecase Would you be interested to share a short slide-deck or similar in a call? (feel free to reach out at lukas.heinrich at cern dot ch)

elibixby commented 2 years ago

Hey @lukasheinrich I got it working without much trouble in the end.

Main hiccups besides those above were:

Easiest path to auth seems to be bunch of cron jobs that sync ReANA users with my users, and a reverse proxy that adds username + password auth headers, feels gross but the only thing to do without rebuilding most of the ReANA frontend API.
Python client was broken between releases, but I appreciate the prompt responses :)
The helm chart somewhat awkwardly couples Ingress and intracluster communication. I tried to use GKE Ingress (managed cert + global load balancer), and it broke worker/master/webserver communication and that took a while to debug, as the service targeted by ingress also serves as the service used for intercluster communication for some reason. Would be nicer to have a separate cluster IP and use Cluster DNS addresses for communication between components, and then an external load balancer targeted by the ingress. I ended up just going back to Traefik
Would be nice to update to a newever version of the Traefik chart, I believe the current one is very out of date, and it took me a while to track down the values file.
The longest amount of time was spent trying to get Ceph to work on GKE. Finally realized that containerd OS doesn't have RBD driver, and autoscaling node pools aren't available for Ubuntu OS So I'm stuck with NFS for now. This actually looks to be more Ceph's fault than Google's or CNCI, as Ceph should really be containerizing the rbd driver as part of their CSI implementation (see https://github.com/kubernetes/enhancements/issues/278 ) but that may be a ton of work.

I'd be happy to get on a call and discuss my use cases if you're interested I'll shoot you an email from eli at cradle dot bio

reanahub / reana

Setting up on Google Cloud #356