I was interested in the k8s helm chart, since I wanted a relatively light-weight (and lower-cost) way of running up metaflow infrastructure on a civo.com cluster. I got most things working, but wanted to capture here some of the tweaks I needed to make along the way. It could be that the current helm is more geared for running against a local cluster, so I didn't put any of this into a pull-request, but I can help on that front, if needed.
My set-up includes a minio service running within my cluster as well as argo-workflows. These were some of my observations:
I updated the metaflow_metadata_service image tag to v2.3.3 in metaflow-service/templates/values.yaml. In metaflow-ui/templates/values.yaml, I used the image public.ecr.aws/outerbounds/metaflow_metadata_service with tag v2.3.3 and updated the tag on public.ecr.aws/outerbounds/metaflow_ui to v.1.1.4.
It turns out, based on here, we need to amend metaflow-ui/templates/static_deployment.yaml to include the three extra lines (with + below). This was the reason why the UI always had a red unable-to-connect status. On changing, then run helm upgrade metaflow metaflow/. That fixes the UI.
It seems Metaflow needs a combination of AWS_ACCESS_* variables as well as ~/.aws/credentials, since sometimes boto3 or awscli are used within the code. I was finding that newly-spawned containers were unable to access s3 (in our case Minio) in order to be able to download a code package, based on the env variables alone, that are in ~./metaflowconfig config file. So I needed to add a volumes/volumeMount section to the metaflow-service/templates/deployment.yaml file ... the effect is to create a /.aws/credentials file within a spawned container. I did the same for the metaflow-ui/templates/backend_deployment.yaml file, for good measure (see + lines below).
I tweaked the scripts/forward_metaflow_ports.py file to handle port-forwarding for minio too. When running argo-workflows too, I'll be running this like: python metaflow-tools/scripts/forward_metaflow_ports.py --include-minio --include-argo
My metaflow config (~/.metaflowconfig/config_k8s-helm-civo.json) looked like this. The "METAFLOW_SERVICE_URL": "http://metaflow-metaflow-service:8080" reflects the in-cluster end-point. I know there is apparently a metaflow-service bundled with the front-end UI on the :8083 end-point, but I could never get that working in my set-up, and instead point to the backend-service on :8080.
Running a basic flow against kubernetes, I found that the only way I could get METAFLOW_SERVICE_URL to point at the in-cluster endpoint was to include it as an env variable within the flow code like: @environment(vars=dict(METAFLOW_SERVICE_URL="http://metaflow-metaflow-service:8080/")). It wasn't enough to have that setting in my config file or to pass it as part of the run command.
METAFLOW_PROFILE=k8s-helm-civo METAFLOW_S3_ENDPOINT_URL=http://minio:9000 AWS_PROFILE=minio-metaflow python branch_flow_k8s_decorator.py run --with kubernetes
from metaflow import FlowSpec, step, resources, environment, kubernetes
class BranchFlow(FlowSpec):
@kubernetes(memory=256,image="continuumio/miniconda3:4.12.0")
@environment(vars=dict(METAFLOW_SERVICE_URL="http://metaflow-metaflow-service:8080/"))
@step
@step
def start(self):
print("hi")
self.next(self.a, self.b)
To get argo-workflows running against this set-up, I used their official helm repo https://github.com/argoproj/argo-helm. Since this is installed against an argo namespace, there were some secrets (which I had under the default namespace) that needed to get duplicated to the argo namespace such as:
One crucial part for getting argo-workflows working in this context was this rbac element below, which isn't well documented anywhere, I find.
kubectl -n argo create rolebinding default-admin --clusterrole=admin --serviceaccount=argo:default
rolebinding.rbac.authorization.k8s.io/default-admin created
To get a flow working in argo-workflows, I needed to append .default to the minio and metaflow-service end-points (since these were running in the default namespace on the cluster. However, as per 5 above, it wasn't enough to hard-code the metaflow-service endpoint with @environment(vars=dict(METAFLOW_SERVICE_URL="http://metaflow-metaflow-service.default:8080/")), since somehow the process of converting a metaflow flow to the argo-workflows spec shifted this back to http://localhost:8083/api each time, so I had to manually adjust this back in the argo-workflows template, which wasn't ideal.
from metaflow import FlowSpec, step, resources, environment, kubernetes
class BranchFlowNew(FlowSpec):
@kubernetes(memory=256,image="continuumio/miniconda3:4.12.0")
@environment(vars=dict(METAFLOW_SERVICE_URL="http://metaflow-metaflow-service.default:8080/"))
@step
@step
def start(self):
print("hi")
self.next(self.a, self.b)
In order that my set-up would work with runs on AWS Batch, I created a separate ~/.metaflowconfig config file that mixed settings for an AWS-based S3 but also a civo-based metadata service. I created an ingress to port 8080 and could then feed "METAFLOW_SERVICE_URL": "http://mf-service.xxxxx-xxxxx-492c-a931-f244d9ccf9a0.k8s.civo.com" to any runs --with batch. This sort-of worked, in that is captures the run in the UI ... but the code doesn't allow for different s3 backends, or it wasn't obvious to me how you could dynamically switch s3 backends, so any DAGs / Task logs (which are on AWS-s3 proper for batch runs) are not reachable on metaflow UI.
I noticed this new video show-casing running a single flow across all cloud providers and I was wondering how the UI works with that scenario. Can the backend be wired to point to each of the different blob-storage backends on those cloud providers?
Thanks again. Hopefully these notes are of use to others that stumble on this.
Thanks Outerbounds team for these resources.
I was interested in the k8s helm chart, since I wanted a relatively light-weight (and lower-cost) way of running up metaflow infrastructure on a civo.com cluster. I got most things working, but wanted to capture here some of the tweaks I needed to make along the way. It could be that the current helm is more geared for running against a local cluster, so I didn't put any of this into a pull-request, but I can help on that front, if needed.
My set-up includes a
minio
service running within my cluster as well asargo-workflows
. These were some of my observations:metaflow_metadata_service
image tag tov2.3.3
inmetaflow-service/templates/values.yaml
. Inmetaflow-ui/templates/values.yaml
, I used the imagepublic.ecr.aws/outerbounds/metaflow_metadata_service
with tagv2.3.3
and updated the tag onpublic.ecr.aws/outerbounds/metaflow_ui
tov.1.1.4
.It turns out, based on here, we need to amend metaflow-ui/templates/static_deployment.yaml to include the three extra lines (with + below). This was the reason why the UI always had a red unable-to-connect status. On changing, then run helm upgrade metaflow metaflow/. That fixes the UI.
AWS_ACCESS_*
variables as well as~/.aws/credentials
, since sometimes boto3 or awscli are used within the code. I was finding that newly-spawned containers were unable to access s3 (in our case Minio) in order to be able to download a code package, based on the env variables alone, that are in~./metaflowconfig
config file. So I needed to add a volumes/volumeMount section to the metaflow-service/templates/deployment.yaml file ... the effect is to create a/.aws/credentials
file within a spawned container. I did the same for the metaflow-ui/templates/backend_deployment.yaml file, for good measure (see+
lines below).For the volumes/volumeMount above, a corresponding aws-secret.yaml secret needs to be added to the cluster with kubectl apply -f aws-secret.yaml.
scripts/forward_metaflow_ports.py
file to handle port-forwarding for minio too. When running argo-workflows too, I'll be running this like:python metaflow-tools/scripts/forward_metaflow_ports.py --include-minio --include-argo
~/.metaflowconfig/config_k8s-helm-civo.json
) looked like this. The"METAFLOW_SERVICE_URL": "http://metaflow-metaflow-service:8080"
reflects the in-cluster end-point. I know there is apparently a metaflow-service bundled with the front-end UI on the :8083 end-point, but I could never get that working in my set-up, and instead point to the backend-service on :8080.Since I'm forwarding the various ports to localhost, I somehow needed to have these entries in my
/etc/hosts
file for end-points to resolve.METAFLOW_SERVICE_URL
to point at the in-cluster endpoint was to include it as an env variable within the flow code like:@environment(vars=dict(METAFLOW_SERVICE_URL="http://metaflow-metaflow-service:8080/"))
. It wasn't enough to have that setting in my config file or to pass it as part of the run command.METAFLOW_PROFILE=k8s-helm-civo METAFLOW_S3_ENDPOINT_URL=http://minio:9000 AWS_PROFILE=minio-metaflow python branch_flow_k8s_decorator.py run --with kubernetes
argo-workflows
running against this set-up, I used their official helm repohttps://github.com/argoproj/argo-helm
. Since this is installed against anargo
namespace, there were some secrets (which I had under thedefault
namespace) that needed to get duplicated to theargo
namespace such as:One crucial part for getting
argo-workflows
working in this context was this rbac element below, which isn't well documented anywhere, I find.argo-workflows
, I needed to append.default
to the minio and metaflow-service end-points (since these were running in thedefault
namespace on the cluster. However, as per 5 above, it wasn't enough to hard-code the metaflow-service endpoint with@environment(vars=dict(METAFLOW_SERVICE_URL="http://metaflow-metaflow-service.default:8080/"))
, since somehow the process of converting a metaflow flow to the argo-workflows spec shifted this back tohttp://localhost:8083/api
each time, so I had to manually adjust this back in the argo-workflows template, which wasn't ideal.METAFLOW_DEFAULT_METADATA=service METAFLOW_PROFILE=k8s-helm-civo METAFLOW_KUBERNETES_SERVICE_ACCOUNT_NAME=argo-workflow METAFLOW_KUBERNETES_NAMESPACE=argo METAFLOW_S3_ENDPOINT_URL=http://minio.default:9000 METAFLOW_SERVICE_URL=http://metaflow-metaflow-service.default:8080 AWS_PROFILE=minio-metaflow python branch_flow_argo.py --datastore=s3 argo-workflows trigger
~/.metaflowconfig
config file that mixed settings for an AWS-based S3 but also a civo-based metadata service. I created an ingress to port 8080 and could then feed"METAFLOW_SERVICE_URL": "http://mf-service.xxxxx-xxxxx-492c-a931-f244d9ccf9a0.k8s.civo.com"
to any runs--with batch
. This sort-of worked, in that is captures the run in the UI ... but the code doesn't allow for different s3 backends, or it wasn't obvious to me how you could dynamically switch s3 backends, so any DAGs / Task logs (which are on AWS-s3 proper for batch runs) are not reachable on metaflow UI.I noticed this new video show-casing running a single flow across all cloud providers and I was wondering how the UI works with that scenario. Can the backend be wired to point to each of the different blob-storage backends on those cloud providers?
Thanks again. Hopefully these notes are of use to others that stumble on this.