Some refinements to k8s helm chart I needed to make for a remote cluster

Thanks Outerbounds team for these resources.

I was interested in the k8s helm chart, since I wanted a relatively light-weight (and lower-cost) way of running up metaflow infrastructure on a civo.com cluster. I got most things working, but wanted to capture here some of the tweaks I needed to make along the way. It could be that the current helm is more geared for running against a local cluster, so I didn't put any of this into a pull-request, but I can help on that front, if needed.

My set-up includes a minio service running within my cluster as well as argo-workflows. These were some of my observations:

I updated the metaflow_metadata_service image tag to v2.3.3 in metaflow-service/templates/values.yaml. In metaflow-ui/templates/values.yaml, I used the image public.ecr.aws/outerbounds/metaflow_metadata_service with tag v2.3.3 and updated the tag on public.ecr.aws/outerbounds/metaflow_ui to v.1.1.4.

It turns out, based on here, we need to amend metaflow-ui/templates/static_deployment.yaml to include the three extra lines (with + below). This was the reason why the UI always had a red unable-to-connect status. On changing, then run helm upgrade metaflow metaflow/. That fixes the UI.

           resources:
             {{- toYaml .Values.resources | nindent 12 }}
+          env:
+            - name: METAFLOW_SERVICE
+              value: http://localhost:8083/api

It seems Metaflow needs a combination of AWS_ACCESS_* variables as well as ~/.aws/credentials, since sometimes boto3 or awscli are used within the code. I was finding that newly-spawned containers were unable to access s3 (in our case Minio) in order to be able to download a code package, based on the env variables alone, that are in ~./metaflowconfig config file. So I needed to add a volumes/volumeMount section to the metaflow-service/templates/deployment.yaml file ... the effect is to create a /.aws/credentials file within a spawned container. I did the same for the metaflow-ui/templates/backend_deployment.yaml file, for good measure (see + lines below).

             {{- include "metaflow-service.metadatadbEnvVars" . | nindent 12 }}
       {{- end }}
+      volumes:
+            - name: aws-minio-file-volume
+              configMap:
+                name: aws-minio-file
       containers:
         - name: {{ .Chart.Name }}
           securityContext:

...

             {{- include "metaflow-service.metadatadbEnvVars" . | nindent 12 }}
+          volumeMounts:
+            - name: aws-minio-file-volume
+              mountPath: /root/.aws/credentials
+              subPath: credentials
       {{- with .Values.nodeSelector }}
       nodeSelector:
         {{- toYaml . | nindent 8 }}

For the volumes/volumeMount above, a corresponding aws-secret.yaml secret needs to be added to the cluster with kubectl apply -f aws-secret.yaml.

apiVersion: v1
kind: Secret
metadata:
  name: aws-credentials
  namespace: default
type: Opaque
stringData:
  credentials: |
    [default]
    aws_access_key_id = metaflow
    aws_secret_access_key = metaflow

I tweaked the scripts/forward_metaflow_ports.py file to handle port-forwarding for minio too. When running argo-workflows too, I'll be running this like: python metaflow-tools/scripts/forward_metaflow_ports.py --include-minio --include-argo

> def run(include_argo, include_airflow, include_minio):
117a118,128
>     if include_minio:
>         port_forwarders.append(
>             PortForwarder(
>                 "minio",
>                 "minio-server",
>                 9000,
>                 True,
>                 namespace="minio",
>                 scheme='https'
>             )
>         )
138a150,151
>     parser.add_argument('--include-minio', action='store_true',
>                         help="Do port forward for minio server (needed for Minio)")
154c167
<     return run(args.include_argo, args.include_airflow)
---
>     return run(args.include_argo, args.include_airflow, args.include_minio)

My metaflow config (~/.metaflowconfig/config_k8s-helm-civo.json) looked like this. The "METAFLOW_SERVICE_URL": "http://metaflow-metaflow-service:8080" reflects the in-cluster end-point. I know there is apparently a metaflow-service bundled with the front-end UI on the :8083 end-point, but I could never get that working in my set-up, and instead point to the backend-service on :8080.

{
    "METAFLOW_DEFAULT_METADATA": "service",
    "METAFLOW_KUBERNETES_NAMESPACE": "default",
    "METAFLOW_KUBERNETES_SERVICE_ACCOUNT": "default",
    "METAFLOW_SERVICE_INTERNAL_URL": "http://localhost:8083/api",
    "METAFLOW_SERVICE_URL": "http://metaflow-metaflow-service:8080",
    "METAFLOW_KUBERNETES_SECRETS": "s3-metaflow-secret",
    "METAFLOW_CONDA_DEPENDENCY_RESOLVER": "mamba",
    "METAFLOW_S3_ENDPOINT_URL": "http://minio:9000",
    "METAFLOW_DEFAULT_DATASTORE": "s3",
    "AWS_ACCESS_KEY_ID": "metaflow",
    "AWS_SECRET_ACCESS_KEY": "metaflow",
    "METAFLOW_DATASTORE_SYSROOT_S3": "s3://minio-metaflow-bucket/metaflow",
    "METAFLOW_DATATOOLS_SYSROOT_S3": "s3://minio-metaflow-bucket/metaflow/data"
}

Since I'm forwarding the various ports to localhost, I somehow needed to have these entries in my /etc/hosts file for end-points to resolve.

mccoole@ubuntu:~$ cat /etc/hosts
127.0.0.1       localhost
127.0.1.1       ubuntu
127.0.0.1       minio
127.0.0.1       minio.default
127.0.0.1       metaflow-metaflow-ui
127.0.0.1       metaflow-metaflow-service
127.0.0.1       metaflow-metaflow-service.default
127.0.0.1       mf-ui.xxxx-xxxx-492c-a931-f244d9ccf9a0.k8s.civo.com

Running a basic flow against kubernetes, I found that the only way I could get METAFLOW_SERVICE_URL to point at the in-cluster endpoint was to include it as an env variable within the flow code like: @environment(vars=dict(METAFLOW_SERVICE_URL="http://metaflow-metaflow-service:8080/")). It wasn't enough to have that setting in my config file or to pass it as part of the run command.

METAFLOW_PROFILE=k8s-helm-civo METAFLOW_S3_ENDPOINT_URL=http://minio:9000 AWS_PROFILE=minio-metaflow python branch_flow_k8s_decorator.py run --with kubernetes

from metaflow import FlowSpec, step, resources, environment, kubernetes

class BranchFlow(FlowSpec): 
    @kubernetes(memory=256,image="continuumio/miniconda3:4.12.0")
    @environment(vars=dict(METAFLOW_SERVICE_URL="http://metaflow-metaflow-service:8080/"))
    @step
    @step
    def start(self):
        print("hi")
        self.next(self.a, self.b)

To get argo-workflows running against this set-up, I used their official helm repo https://github.com/argoproj/argo-helm. Since this is installed against an argo namespace, there were some secrets (which I had under the default namespace) that needed to get duplicated to the argo namespace such as:

kubectl create secret generic s3-secret-minio -n argo --from-literal=accesskey=metaflow --from-literal=secretkey=metaflow

One crucial part for getting argo-workflows working in this context was this rbac element below, which isn't well documented anywhere, I find.

kubectl -n argo create rolebinding default-admin --clusterrole=admin --serviceaccount=argo:default
rolebinding.rbac.authorization.k8s.io/default-admin created

To get a flow working in argo-workflows, I needed to append .default to the minio and metaflow-service end-points (since these were running in the default namespace on the cluster. However, as per 5 above, it wasn't enough to hard-code the metaflow-service endpoint with @environment(vars=dict(METAFLOW_SERVICE_URL="http://metaflow-metaflow-service.default:8080/")), since somehow the process of converting a metaflow flow to the argo-workflows spec shifted this back to http://localhost:8083/api each time, so I had to manually adjust this back in the argo-workflows template, which wasn't ideal.

METAFLOW_DEFAULT_METADATA=service METAFLOW_PROFILE=k8s-helm-civo METAFLOW_KUBERNETES_SERVICE_ACCOUNT_NAME=argo-workflow METAFLOW_KUBERNETES_NAMESPACE=argo METAFLOW_S3_ENDPOINT_URL=http://minio.default:9000 METAFLOW_SERVICE_URL=http://metaflow-metaflow-service.default:8080 AWS_PROFILE=minio-metaflow python branch_flow_argo.py --datastore=s3 argo-workflows trigger

from metaflow import FlowSpec, step, resources, environment, kubernetes

class BranchFlowNew(FlowSpec): 
    @kubernetes(memory=256,image="continuumio/miniconda3:4.12.0")
    @environment(vars=dict(METAFLOW_SERVICE_URL="http://metaflow-metaflow-service.default:8080/"))
    @step
    @step
    def start(self):
        print("hi")
        self.next(self.a, self.b)

In order that my set-up would work with runs on AWS Batch, I created a separate ~/.metaflowconfig config file that mixed settings for an AWS-based S3 but also a civo-based metadata service. I created an ingress to port 8080 and could then feed "METAFLOW_SERVICE_URL": "http://mf-service.xxxxx-xxxxx-492c-a931-f244d9ccf9a0.k8s.civo.com" to any runs --with batch. This sort-of worked, in that is captures the run in the UI ... but the code doesn't allow for different s3 backends, or it wasn't obvious to me how you could dynamically switch s3 backends, so any DAGs / Task logs (which are on AWS-s3 proper for batch runs) are not reachable on metaflow UI.

I noticed this new video show-casing running a single flow across all cloud providers and I was wondering how the UI works with that scenario. Can the backend be wired to point to each of the different blob-storage backends on those cloud providers?

Thanks again. Hopefully these notes are of use to others that stumble on this.

outerbounds / metaflow-tools

Some refinements to k8s helm chart I needed to make for a remote cluster #26