kubeflow / fairing

Python SDK for building, training, and deploying ML models
Apache License 2.0
337 stars 144 forks source link

Append Builder: Error seldon-core-microservice not found; Default image not correct #196

Open jlewi opened 5 years ago

jlewi commented 5 years ago

I tried building a deploying an endpoint

fairing.config.set_deployer('serving', serving_class="LabelPrediction")
create_endpoint = fairing.config.fn(LabelPrediction)
create_endpoint()

The deployed pod ended up crashing with the following error.

kubectl logs fairing-deployer-9mkf8-57c554f979-c572d
container_linux.go:247: starting container process caused "exec: \"seldon-core-microservice\": executable file not found in $PATH"

Container spec is

spec:
  containers:
  - command:
    - seldon-core-microservice
    - LabelPrediction
    - REST
    - --service-type=MODEL
    - --persistence=0
    env:
    - name: FAIRING_RUNTIME
      value: "1"
    image: gcr.io/code-search-demo/fairing-job:BEF9445D
    imagePullPolicy: IfNotPresent
    name: model
    resources: {}
    securityContext:
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-rqqln
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: gke-label-issues-040-label-issues-040-cf06394a-t02q
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-rqqln
    secret:
      defaultMode: 420
      secretName: default-token-rqqln
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-04-10T03:43:37Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2019-04-10T03:43:37Z
    message: 'containers with unready status: [model]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    message: 'containers with unready status: [model]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2019-04-10T03:43:37Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://dbe925a3d4fe386348ee5b6b4a2e533e3cf449af51af9335afe1298b5e9c656f
    image: gcr.io/code-search-demo/fairing-job:BEF9445D
    imageID: docker-pullable://gcr.io/code-search-demo/fairing-job@sha256:ddea06ddd16c92f80d9e14bf5e1c126635eedefdbe23c52067724c7430807262
    lastState:
      terminated:
        containerID: docker://dbe925a3d4fe386348ee5b6b4a2e533e3cf449af51af9335afe1298b5e9c656f
        exitCode: 127
        finishedAt: 2019-04-10T03:51:23Z
        message: |
          oci runtime error: container_linux.go:247: starting container process caused "exec: \"seldon-core-microservice\": executable file not found in $PATH"
        reason: ContainerCannotRun
        startedAt: 2019-04-10T03:51:23Z
    name: model
    ready: false
    restartCount: 6
    state:
      waiting:
        message: Back-off 5m0s restarting failed container=model pod=fairing-deployer-9mkf8-57c554f979-c572d_kubeflow(d2ee14f0-5b42-11e9-aa05-42010a8e0051)
        reason: CrashLoopBackOff
  hostIP: 10.142.0.10
  phase: Running
  podIP: 10.0.0.29
  qosClass: BestEffort
  startTime: 2019-04-10T03:43:37Z
jlewi commented 5 years ago

It looks like the default image is

from fairing import constants
constants.constants.DEFAULT_BASE_IMAGE
gcr.io/kubeflow-images-public/fairing:dev
jlewi commented 5 years ago

That image looks pretty old. Jan 23 2019.

jlewi commented 5 years ago

https://github.com/kubeflow/fairing/blob/736c025e6d77f135bda345e5030398d5d2ef654a/examples/prediction/README.md

Looks like maybe we should be using: seldonio/seldon-core-s2i-python3:0.4

jlewi commented 5 years ago

Looks like that fixed that problem. Now I get a different error

2019-04-10 04:12:01,837 - seldon_core.microservice:main:261 - INFO:  Starting microservice.py:main
2019-04-10 04:12:01,839 - seldon_core.microservice:main:292 - INFO:  Annotations: {}
Traceback (most recent call last):
  File "/usr/local/bin/seldon-core-microservice", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/seldon_core/microservice.py", line 294, in main
    interface_file = importlib.import_module(args.interface_name)
  File "/usr/local/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'LabelPrediction'

So it looks like its a bug with the default image not being valid.

jlewi commented 5 years ago

I think this might be fixed by https://github.com/kubeflow/fairing/pull/207

jlewi commented 5 years ago

@karthikv2k Any update on the status of this work?

I think one way to test this would be to use the demo notebook I used for kubecon https://github.com/jlewi/kubecon-demo/blob/master/ames-xgboost-build-train-deploy.ipynb

I was using a fork of fairing. It would be great if that fork wasn't needed and we could run against master. Its possible that is already the case because all the requisite fixes like #207 have been merged into master.

karthikv2k commented 5 years ago

no updates. I should be able to look into it next week.

jlewi commented 5 years ago

@karthikv2k Any update on this?

jlewi commented 5 years ago

It looks like this still isn't fixed. Code is still using gcr.io/kubeflow-images-public/fairing:dev https://github.com/kubeflow/fairing/blob/87c1185cde356939494ff4e9631c0b490b27153a/fairing/constants/constants.py

And that image is still very old January 22, 2019

It looks like the code in the original bug report is using the higher level API; using the append builder directly works just fine See for example https://github.com/kubeflow/examples/tree/master/xgboost_synthetic

kierenj commented 4 years ago

I've been trying to see if I can get something cool going with fairing for the past couple of days. Unfortunately I just ran into this one. In my case, it's based on the XGBoost sample notebook code - the high level APIs one you spoke about (https://github.com/kubeflow/fairing/blob/master/examples/prediction/xgboost-high-level-apis.ipynb):

    endpoint = PredictionEndpoint(HousingServe, input_files=['model.dat'],
                                  service_type='LoadBalancer',
                                  docker_registry=DOCKER_REGISTRY,
                                  backend=BackendClass(build_context_source=BuildContext))
    endpoint.create()

I understand Kubeflow and fairing are super early alpha, but I'm really keen to get something together and working. The fairing docs/samples look to use the high level API, am I understanding correctly that that's just not working at the moment (and that low level APIs aren't documented)?

Is there anything I can do to inform myself of how to get something up and running? Should I try overriding the base image somehow in PredictionEndpoint - was there a known-good one or another image I can build myself somehow?

dylanpiergies commented 4 years ago

I've hit this as well, deploying my own model:

endpoint = PredictionEndpoint(MyModelServe, input_files=included_files,
                              service_type='LoadBalancer',
                              docker_registry='{}.dkr.ecr.{}.amazonaws.com'.format(AWS_ACCOUNT_ID, AWS_REGION),
                              backend=BackendClass(build_context_source=build_context))
endpoint.create()