kubeflow-kale / kale

Kubeflow’s superfood for Data Scientists
http://kubeflow-kale.github.io
Apache License 2.0
632 stars 128 forks source link

Failed to define a PVC size using kale deployment panel #18

Closed felihong closed 4 years ago

felihong commented 4 years ago

Hi there,

first of all thanks for developing such a great and useful tool!

I installed kale in my kubeflow notebook server based on GKE (with snapshot created) and cloned the titanic example to give it a try.

The pipeline can be successfully complied, and uploaded, however the loaddata component cannot be completed as there comes a warning of This step is in Pending state with this message: Unschedulable: pod has unbound immediate PersistentVolumeClaims (repeated 2 times).

Screenshot 2019-11-20 at 18 08 32

And here's the log file

11-20 17:50 | kubeflow-kale |  DEBUG: ------------- Kale Start Run -------------
11-20 17:50 | kubeflow-kale |  INFO: Pipeline code saved at kfp_titanic-ml-pipeline.kfp.py
11-20 17:50 | kubeflow-kale |  INFO: Deployment Successful. Pipeline run at None/#/runs/details/52a84d9e-b1bc-41cf-af40-cc87785c5b7f

It would be very appreciated if someone could maybe kindly point me out whether I configured the volumes correctly. Thanks!

StefanoFioravanzo commented 4 years ago

Hey @Felihong ! Thank you for trying out Kale. I would suggest to try out Kale from the kubecon-workshop (same thing for the JuyterLab extension). There are loads of new features there that we will merge into master in the next weeks. Please note that we have been developing targeting MiniKF, so there might be issues when running Kale outside MiniKF.

StefanoFioravanzo commented 4 years ago

In any case, please provide any detailed issue you may find using the versions under kubecon-workshop branches, so that we can track them and solve any issue when running Kale in a Kubeflow cluster.

felihong commented 4 years ago

Thanks for the timely reply!

So I installed the kubecon-workshop branch of kubeflow-kale using pip install git+https://github.com/kubeflow-kale/kale.git@kubecon-workshop.

However to install the jupyterlab extension using pip install git+https://github.com/kubeflow-kale/jupyterlab-kubeflow-kale.git@kubecon-workshop, I got an error as below:

Collecting git+https://github.com/kubeflow-kale/jupyterlab-kubeflow-kale.git@kubecon-workshop
  Cloning https://github.com/kubeflow-kale/jupyterlab-kubeflow-kale.git (to revision kubecon-workshop) to /tmp/pip-req-build-3y7xrwyy
  Running command git clone -q https://github.com/kubeflow-kale/jupyterlab-kubeflow-kale.git /tmp/pip-req-build-3y7xrwyy
  Running command git checkout -b kubecon-workshop --track origin/kubecon-workshop
  Switched to a new branch 'kubecon-workshop'
  Branch 'kubecon-workshop' set up to track remote branch 'kubecon-workshop' from 'origin'.
ERROR: Command errored out with exit status 1:
     command: /opt/conda/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-3y7xrwyy/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-3y7xrwyy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-req-build-3y7xrwyy/pip-egg-info
         cwd: /tmp/pip-req-build-3y7xrwyy/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/opt/conda/lib/python3.7/tokenize.py", line 447, in open
        buffer = _builtin_open(filename, 'rb')
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-3y7xrwyy/setup.py'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Any ideas?

felihong commented 4 years ago

I just followed the instructions in the Contributing part, the extension has been properly installed. Thanks!

felihong commented 4 years ago

Hi, here are some issues I found under the kubecon-workshop branch:

In the generated candies-sharing-urgrg.kale.py script the experiment name seems failed to be generated:

if __name__ == "__main__":
    pipeline_func = auto_generated_pipeline
    pipeline_filename = pipeline_func.__name__ + '.pipeline.tar.gz'
    import kfp.compiler as compiler
    compiler.Compiler().compile(pipeline_func, pipeline_filename)

    # Get or create an experiment and submit a pipeline run
    import kfp
    client = kfp.Client()
    experiment = client.create_experiment('')

    # Submit a pipeline run
    run_name = 'candies-sharing-urgrg_run'
    run_result = client.run_pipeline(
        experiment.id, run_name, pipeline_filename, {})

The pipeline can be successfully uploaded but also failed to run.

I also found that even though I defined the experiment name for the titanic-dataset-ml notebook as kale-titanic-experiment, the generated experiment name is not a match:

if __name__ == "__main__":
    pipeline_func = auto_generated_pipeline
    pipeline_filename = pipeline_func.__name__ + '.pipeline.tar.gz'
    import kfp.compiler as compiler
    compiler.Compiler().compile(pipeline_func, pipeline_filename)

    # Get or create an experiment and submit a pipeline run
    import kfp
    client = kfp.Client()
    experiment = client.create_experiment('Titanic')

    # Submit a pipeline run
    run_name = 'titanic-ml-j9crb_run'
    run_result = client.run_pipeline(
        experiment.id, run_name, pipeline_filename, {})
This step is in Pending state with this message: Unschedulable: pod has unbound immediate PersistentVolumeClaims (repeated 2 times)

Status of the automatically generated PVC:

Name:          titanic-ml-xd79q-zdxbx-kale-marshal-pvc
Namespace:     kubeflow
StorageClass:  standard
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Events:
  Type       Reason              Age                  From                         Message
  ----       ------              ----                 ----                         -------
  Warning    ProvisioningFailed  38s (x597 over 22h)  persistentvolume-controller  Failed to provision volume with StorageClass "standard": invalid AccessModes [ReadWriteMany]: only AccessModes [ReadWriteOnce ReadOnlyMany] are supported
Mounted By:  <none>

PS: No volumes mounts are manually defined in both experiments.

StefanoFioravanzo commented 4 years ago

Hey @Felihong , I suspect you are running Kale in your own Kubeflow cluster and not in MiniKF, is that right? Currently we have everything working in MiniKF and in the next week's we will work to expand support to full Kubeflow cluster!

Can you provide more information about your environment?

felihong commented 4 years ago

Hi @StefanoFioravanzo , yes I'm running kale in a kubeflow cluster which is deployed on GKE, and it will be cool if kale could be expanded to full kubeflow cluster. Thank you and I'm very glad to help!

And here are some specifications of the pod where the notebook is running:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    sidecar.istio.io/status: '{"version":"5f3ae3613c7945ef767cb9fd594596bc001ff3ab915f12e4379c0cb5648d2729","initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-certs"],"imagePullSecrets":null}'
  generateName: kale-test
  labels:
    app: kale-conda-test
    controller-revision-hash: kale-test-5f6587c7d5
    notebook-name: kale-test
    statefulset: kale-test
    statefulset.kubernetes.io/pod-name: kale-test-0
  name: kale-test-0
  namespace: [USER_NAMESPACE]
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: kale-test
    uid: 1cc5023c-0d37-11ea-ba66-42010a84025a
  resourceVersion: "2762529"
  selfLink: /api/v1/namespaces/[USER_NAMESPACE]/pods/kale-test-0
  uid: 1d47af68-0d37-11ea-ba66-42010a84025a
spec:
  containers:
  - env:
    - name: NB_PREFIX
      value: /notebook/[USER_NAMESPACE]/kale-test
    image: [IMAGE_NAME]
    imagePullPolicy: IfNotPresent
    name: kale-test
    ports:
    - containerPort: 8888
      name: notebook-port
      protocol: TCP
    resources:
      requests:
        cpu: 500m
        memory: 1Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /dev/shm
      name: dshm
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-editor-token-6qblc
      readOnly: true
    workingDir: /home/jovyan
  - args:
    - proxy
    - sidecar
    - --domain
    - $(POD_NAMESPACE).svc.cluster.local
    - --configPath
    - /etc/istio/proxy
    - --binaryPath
    - /usr/local/bin/envoy
    - --serviceCluster
    - kale-test.$(POD_NAMESPACE)
    - --drainDuration
    - 45s
    - --parentShutdownDuration
    - 1m0s
    - --discoveryAddress
    - istio-pilot.istio-system:15010
    - --zipkinAddress
    - zipkin.istio-system:9411
    - --connectTimeout
    - 10s
    - --proxyAdminPort
    - "15000"
    - --concurrency
    - "2"
    - --controlPlaneAuthPolicy
    - NONE
    - --statusPort
    - "15020"
    - --applicationPorts
    - "8888"
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: INSTANCE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: ISTIO_META_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: ISTIO_META_CONFIG_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: ISTIO_META_INTERCEPTION_MODE
      value: REDIRECT
    - name: ISTIO_METAJSON_LABELS
      value: |
        {"app":"kale-test","controller-revision-hash":"kale-test-5f6587c7d5","notebook-name":"kale-test","statefulset":"kale-test","statefulset.kubernetes.io/pod-name":"kale-test-0"}
    image: docker.io/istio/proxyv2:1.1.6
    imagePullPolicy: IfNotPresent
    name: istio-proxy
    ports:
    - containerPort: 15090
      name: http-envoy-prom
      protocol: TCP
    readinessProbe:
      failureThreshold: 30
      httpGet:
        path: /healthz/ready
        port: 15020
        scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 2
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "2"
        memory: 128Mi
      requests:
        cpu: 10m
        memory: 40Mi
    securityContext:
      readOnlyRootFilesystem: true
      runAsUser: 1337
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/istio/proxy
      name: istio-envoy
    - mountPath: /etc/certs/
      name: istio-certs
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-editor-token-6qblc
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: kale-test-0
  initContainers:
  - args:
    - -p
    - "15001"
    - -u
    - "1337"
    - -m
    - REDIRECT
    - -i
    - '*'
    - -x
    - ""
    - -b
    - "8888"
    - -d
    - "15020"
    image: docker.io/istio/proxy_init:1.1.6
    imagePullPolicy: IfNotPresent
    name: istio-init
    resources:
      limits:
        cpu: 100m
        memory: 50Mi
      requests:
        cpu: 10m
        memory: 10Mi
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
      runAsNonRoot: false
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  nodeName: [NODE_NAME]
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 100
  serviceAccount: default-editor
  serviceAccountName: default-editor
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir:
      medium: Memory
    name: dshm
  - name: default-editor-token-6qblc
    secret:
      defaultMode: 420
      secretName: default-editor-token-6qblc
  - emptyDir:
      medium: Memory
    name: istio-envoy
  - name: istio-certs
    secret:
      defaultMode: 420
      optional: true
      secretName: istio.default-editor
elikatsis commented 4 years ago

Hi @Felihong ,

Regarding the unbound PVC, Kale creates PVCs setting access mode to ReadWriteMany. Does your cluster support RWX PVCs?

felihong commented 4 years ago

Hey @elikatsis ,

thanks for the info!

I'm using the volume type gcePersistentDisk which is defined by default. Unfortunately it seems like it doesn't support RWX access mode yet https://kubernetes.io/docs/concepts/storage/persistent-volumes/#types-of-persistent-volumes

I guess in this case I should create another StorageClass and use some type like NFS? Btw should this volume be data volume or workspace volume which can be created in the kubeflow UI? I'm kind of confused...

Thank you in advance!

elikatsis commented 4 years ago

I guess in this case I should create another StorageClass and use some type like NFS?

That would work. But you should also set it as default storage class because it is the default storage class that gets chosen. We could add features to choose such options in the future.

Btw should this volume be data volume or workspace volume which can be created in the kubeflow UI? I'm kind of confused...

Volumes mounted on pipeline steps don't know anything about workspace volume or data volumes. Let's say that they are all considered data volumes. So, the marshal volume, which is the one you have issues with, is a filesystem where data passed between steps are saved to and loaded from.

There is only one special case: If you use the default notebook image, along with your notebook server's workspace volume mounted under the same mount point (/home/jovyan), then any installed library will be present during a step's execution.

felihong commented 4 years ago

Hi there 👋,

so my pod is using now a google Filestore backed PV with RWX access mode, which is dynamically provisioned.

kubectl get storageclass     
NAME                   PROVISIONER                                   AGE
nfs-client (default)   cluster.local/nfs-cp-nfs-client-provisioner   39m
standard               kubernetes.io/gce-pd                          4d21h

The problem now seems different as before when I test run the example pipeline candy sharing. In the first component kale-marshal-volume I can see the pipeline is now bounded successfully to a volume:

kale-marshal-volume-manifest

map[apiVersion:v1 metadata:map[name:candies-sharing-5zg0x-wmr5v-kale-marshal-pvc namespace:kubeflow 
selfLink:/api/v1/namespaces/kubeflow/persistentvolumeclaims/candies-sharing-5zg0x-wmr5v-kale-marshal-pvc uid:5eff311e-14ea-11ea-a64c-42010a84021c resourceVersion:3100421 creationTimestamp:2019-12-02T09:59:03Z 
annotations:map[pv.kubernetes.io/bind-completed:yes pv.kubernetes.io/bound-by-controller:yes volume.beta.kubernetes.io/storage-provisioner:cluster.local/nfs-cp-nfs-client-provisioner] finalizers:[kubernetes.io/pvc-protection]] 
spec:map[resources:map[requests:map[storage:1Gi]] volumeName:pvc-5eff311e-14ea-11ea-a64c-42010a84021c storageClassName:nfs-client volumeMode:Filesystem accessModes:[ReadWriteMany]] 
status:map[phase:Bound accessModes:[ReadWriteMany] capacity:map[storage:1Gi]] kind:PersistentVolumeClaim]

kale-marshal-volume-name
candies-sharing-5zg0x-wmr5v-kale-marshal-pvc

kale-marshal-volume-size
1Gi

However in the second component sack, I met an error somehow related to snapshot:

Traceback (most recent call last):
    File "<string>", line 36, in <module>
    File "<string>", line 16, in sack
    File "/opt/conda/lib/python3.7/site-packages/kale/utils/pod_utils.py", line 171, in snapshot_pipeline_step
       from rok_gw_client.client import RokClient
ModuleNotFoundError: No module named 'rok_gw_client'

I also tried to manually define the volume and also created/defined a snapshot for that, the error stays the same. Any ideas? Thank you!

StefanoFioravanzo commented 4 years ago

@Felihong It looks like you are now able to correctly provision and bind a volume, that is great. The issue now is that we have been building images using the rok_gw_client library, that is not publicly available. @elikatsis we should make sure that rok_gw_client is not a hard dependency, and if it fails to import then rok integration is disabled.

StefanoFioravanzo commented 4 years ago

Opened #20 to track this

elikatsis commented 4 years ago

@Felihong, I'm glad you setup a storage class that can provide RWX PVCs. [I'd been trying to find out what's up with the glusterfs issue but couldn't find any info.]

I've also figured out what is wrong with the pipeline name and experiment name issue you've mentioned in your first comments. Thank you for reporting it! A fix for that will be included in upcoming releases.

The rok_gw_client issue you mention should only occur if you have the Take Rok snapshots before each step UI option switched on. Try turning it off when deploying the pipeline.

Edit: We will make sure that Rok/MiniKF specific options are disabled when these features are properly released.

felihong commented 4 years ago

@elikatsis Thanks for pointing out! Looking forward to the new releases :)

Regarding the rod_gw_client issue, I didn't managed to locate the Take Rok snapshots before each step option in the extension. Do you actually mean the KALE DEPLOYMENT PANEL? The truth is I only defined the name of experiment and pipeline, no volumes are from my side defined. (I don't even have a snapshot API enabled in my cluster).

Is there a way to do some edits in my local installed pod_utils.py script to disable the rok integration?

elikatsis commented 4 years ago

Regarding the rod_gw_client issue, I didn't managed to locate the Take Rok snapshots before each step option in the extension. Do you actually mean the KALE DEPLOYMENT PANEL?

Yes, in KALE DEPLOYMENT PANEL you should set your volumes pane like this:

image

I believe it should work if you set these options as such. Rok is imported lazily, only when it is called, so if you disable all related options [which should be possible at this moment], then it should work. Please report back if you try it and it doesn't work.


Is there a way to do some edits in my local installed pod_utils.py script to disable the rok integration?

That would not be very easy. You would only modify the current container's filesystem, not the docker image which is used for the pods. You would have to build a new docker image with custom Kale installation and pass that to all steps via Additional Settings. But if you, let's say, delete lines that seem related to Rok and snapshotting, that would be different than disabling features so something could break.

felihong commented 4 years ago

Hi @elikatsis , thank you so much for your suggestion! I now noticed it is my container not up-to-date that I can't get the latest extension version.

Good news is I tried with this image gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop according to https://codelabs.developers.google.com/codelabs/cloud-kubeflow-minikf-kale/#3 in my cluster to run the base-example, and set the volume panel as above, it works perfectly! 😊

Regarding my notebook image, I used the following commands to integrate and build kale and kale jupyterlab extension (based on jupyterlab 1.1.1) in my Dockerfile:

RUN pip install git+https://github.com/kubeflow-kale/kale.git@kubecon-workshop    

RUN git clone https://github.com/kubeflow-kale/jupyterlab-kubeflow-kale.git \
    && cd jupyterlab-kubeflow-kale \
    && jlpm install \
    && jlpm run build \
    && jupyter labextension install .     

It works but obviously doesn't lead me to the latest version. Did I miss some points here? And would that be possible to share me some details (Dockerfile etc.) about image gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop?

Thanks!

elikatsis commented 4 years ago

JupyterLab Extension development also lives in a kubecon-workshop branch. Using those commands you have installed the master version.

gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop image has this as base image: gcr.io/kubeflow-images-public/tensorflow-1.14.0-notebook-cpu:v-base-ef41372-1177829795472347138. Then we install Rok, latest KFP and Kale, Kale JupyterLab extension from kubecon-workshop branch.

Finally, we run jupyter lab instead of jupyter notebook.

felihong commented 4 years ago

Hi @elikatsis , you are right about the branch, I mis-pulled master branch. Now I'm using the kubecon-workshop branch and everything works just fine! Thank you!