mlopsworks / charms

WIP charms
Apache License 2.0
5 stars 3 forks source link

connect mlflow to seldon core #6

Open lukemarsden opened 3 years ago

lukemarsden commented 3 years ago

goal: be able to publish a model from mlflow into seldon core (running in kubeflow) so that users can easily deploy models that they are tracking/managing in mlflow

egranell commented 3 years ago

It seems that seldon-core is not correctly configured:

root@aac9e2c0417cad2a:~# kubectl logs -n kf  seldon-core-6fdf95f864-nst87 | head -n 20           
2021-02-17T08:39:04.674Z        INFO    controller-runtime.metrics      metrics server is starting to listen    {"addr": ":8080"}
2021-02-17T08:39:04.696Z        INFO    controller-runtime.builder      Registering a mutating webhook  {"GVK": "machinelearning.seldon.io/v1alpha2, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1alpha2-seldondeployment"}
2021-02-17T08:39:04.696Z        INFO    controller-runtime.webhook      registering webhook     {"path": "/mutate-machinelearning-seldon-io-v1alpha2-seldondeployment"}
2021-02-17T08:39:04.696Z        INFO    controller-runtime.builder      Registering a validating webhook        {"GVK": "machinelearning.seldon.io/v1alpha2, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1alpha2-seldondeployment"}
2021-02-17T08:39:04.696Z        INFO    controller-runtime.webhook      registering webhook     {"path": "/validate-machinelearning-seldon-io-v1alpha2-seldondeployment"}
2021-02-17T08:39:04.696Z        INFO    controller-runtime.builder      Registering a mutating webhook  {"GVK": "machinelearning.seldon.io/v1alpha3, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1alpha3-seldondeployment"}
2021-02-17T08:39:04.696Z        INFO    controller-runtime.webhook      registering webhook     {"path": "/mutate-machinelearning-seldon-io-v1alpha3-seldondeployment"}
2021-02-17T08:39:04.696Z        INFO    controller-runtime.builder      Registering a validating webhook        {"GVK": "machinelearning.seldon.io/v1alpha3, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1alpha3-seldondeployment"}
2021-02-17T08:39:04.696Z        INFO    controller-runtime.webhook      registering webhook     {"path": "/validate-machinelearning-seldon-io-v1alpha3-seldondeployment"}
2021-02-17T08:39:04.696Z        INFO    controller-runtime.builder      Registering a mutating webhook  {"GVK": "machinelearning.seldon.io/v1, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1-seldondeployment"}
2021-02-17T08:39:04.696Z        INFO    controller-runtime.webhook      registering webhook     {"path": "/mutate-machinelearning-seldon-io-v1-seldondeployment"}
2021-02-17T08:39:04.697Z        INFO    controller-runtime.builder      Registering a validating webhook        {"GVK": "machinelearning.seldon.io/v1, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1-seldondeployment"}
2021-02-17T08:39:04.697Z        INFO    controller-runtime.webhook      registering webhook     {"path": "/validate-machinelearning-seldon-io-v1-seldondeployment"}
2021-02-17T08:39:04.697Z        INFO    setup   starting manager
I0217 08:39:04.697745       1 leaderelection.go:242] attempting to acquire leader lease  kf/a33bd623.machinelearning.seldon.io...
2021-02-17T08:39:04.707Z        INFO    controller-runtime.manager      starting metrics server {"path": "/metrics"}
E0217 08:39:04.728161       1 leaderelection.go:331] error retrieving resource lock kf/a33bd623.machinelearning.seldon.io: configmaps "a33bd623.machinelearning.seldon.io" is forbidden: User "system:serviceaccount:kf:seldon-core" cannot get resource "configmaps" in API group "" in the namespace "kf"
2021-02-17T08:39:04.802Z        INFO    controller-runtime.webhook.webhooks     starting webhook server
2021-02-17T08:39:04.803Z        INFO    controller-runtime.certwatcher  Updated current TLS certificate
2021-02-17T08:39:04.804Z        INFO    controller-runtime.webhook      serving webhook server  {"host": "", "port": 9876}
egranell commented 3 years ago

Following the instructions from here:

  1. A namespace label set as serving.kubeflow.org/inferenceservice=enabled: kubectl label namespace kf serving.kubeflow.org/inferenceservice=enabled
  2. Istio Gateway: `cat <<EOF | kubectl create -n kf -f - apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: kubeflow-gateway spec: selector: istio: ingressgateway servers:
    • hosts:
    • '*' port: name: http number: 80 protocol: HTTP EOF`
  3. Create secret to access minio storage: https://github.com/mlopsworks/charms/blob/8f71b5449fc941d1a2d7e404f9e7317b39a63cf9/mlflow/src/charm.py#L272
  4. Train a MLFlow model and store the artifacts to minio.
  5. Create a SeldonDeployment :
    cat <<EOF | kubectl create -n kf -f -
    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    metadata:
    name: mlflow
    spec:
    annotations:
    seldon.io/executor: "true"
    name: wines
    predictors:
    - componentSpecs:
      - spec:
          containers:
          - name: classifier
            livenessProbe:
              initialDelaySeconds: 150
              failureThreshold: 300
              periodSeconds: 10
              successThreshold: 1
              httpGet:
                path: /health
                port: http
                scheme: HTTP
            readinessProbe:
              initialDelaySeconds: 150
              failureThreshold: 300
              periodSeconds: 10
              successThreshold: 1
              httpGet:
                path: /health
                port: http
                scheme: HTTP
      graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: s3://mlflow/0/59bf5cb90345488289b4f4c5f702b560/artifacts/model/
        envSecretRefName: seldon-init-container-secret
        name: ElasticnetWineModel
      name: default
      replicas: 1
    EOF

Following the steps some times I get the SeldonDeploment created: seldondeployment.machinelearning.seldon.io/mlflow created but never becomes available. Must of the times I get the following error: Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "mseldondeployment.kb.io": Post https://seldon-core.kf.svc:443/mutate-machinelearning-seldon-io-v1-seldondeployment?timeout=30s: no service port 'ƻ' found for service "seldon-core"