Closed felihong closed 4 years ago
Hey @Felihong ! Thank you for trying out Kale. I would suggest to try out Kale from the kubecon-workshop
(same thing for the JuyterLab extension). There are loads of new features there that we will merge into master in the next weeks. Please note that we have been developing targeting MiniKF, so there might be issues when running Kale outside MiniKF.
In any case, please provide any detailed issue you may find using the versions under kubecon-workshop
branches, so that we can track them and solve any issue when running Kale in a Kubeflow cluster.
Thanks for the timely reply!
So I installed the kubecon-workshop branch of kubeflow-kale using pip install git+https://github.com/kubeflow-kale/kale.git@kubecon-workshop
.
However to install the jupyterlab extension using pip install git+https://github.com/kubeflow-kale/jupyterlab-kubeflow-kale.git@kubecon-workshop
, I got an error as below:
Collecting git+https://github.com/kubeflow-kale/jupyterlab-kubeflow-kale.git@kubecon-workshop
Cloning https://github.com/kubeflow-kale/jupyterlab-kubeflow-kale.git (to revision kubecon-workshop) to /tmp/pip-req-build-3y7xrwyy
Running command git clone -q https://github.com/kubeflow-kale/jupyterlab-kubeflow-kale.git /tmp/pip-req-build-3y7xrwyy
Running command git checkout -b kubecon-workshop --track origin/kubecon-workshop
Switched to a new branch 'kubecon-workshop'
Branch 'kubecon-workshop' set up to track remote branch 'kubecon-workshop' from 'origin'.
ERROR: Command errored out with exit status 1:
command: /opt/conda/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-3y7xrwyy/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-3y7xrwyy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-req-build-3y7xrwyy/pip-egg-info
cwd: /tmp/pip-req-build-3y7xrwyy/
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.7/tokenize.py", line 447, in open
buffer = _builtin_open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-3y7xrwyy/setup.py'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Any ideas?
I just followed the instructions in the Contributing
part, the extension has been properly installed. Thanks!
Hi, here are some issues I found under the kubecon-workshop
branch:
candies_sharing
After I defined the experiment name
and pipeline name
in the panel there comes an error saying the experiment name is empty:
(400) Reason: Bad Request HTTP response headers:
HTTPHeaderDict({'content-type': 'application/json', 'trailer': 'Grpc-Trailer-Content-Type', 'date': 'Fri, 22 Nov 2019 09:33:56 GMT', 'x-envoy-upstream-service-time': '0', 'server': 'envoy', 'transfer-encoding': 'chunked'})
HTTP response body: {
"error":"Validate experiment request failed.: Invalid input error: Experiment name is empty. Please specify a valid experiment name.",
"message":"Validate experiment request failed.: Invalid input error: Experiment name is empty. Please specify a valid experiment name.","code":3,
"details":[{"@type":"type.googleapis.com/api.Error","error_message":"Experiment name is empty. Please specify a valid experiment name.","error_details":"Validate experiment request failed.: Invalid input error: Experiment name is empty. Please specify a valid experiment name."}]}
In the generated candies-sharing-urgrg.kale.py
script the experiment name seems failed to be generated:
if __name__ == "__main__":
pipeline_func = auto_generated_pipeline
pipeline_filename = pipeline_func.__name__ + '.pipeline.tar.gz'
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline_func, pipeline_filename)
# Get or create an experiment and submit a pipeline run
import kfp
client = kfp.Client()
experiment = client.create_experiment('')
# Submit a pipeline run
run_name = 'candies-sharing-urgrg_run'
run_result = client.run_pipeline(
experiment.id, run_name, pipeline_filename, {})
The pipeline can be successfully uploaded but also failed to run.
I also found that even though I defined the experiment name for the titanic-dataset-ml
notebook as kale-titanic-experiment
, the generated experiment name is not a match:
if __name__ == "__main__":
pipeline_func = auto_generated_pipeline
pipeline_filename = pipeline_func.__name__ + '.pipeline.tar.gz'
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline_func, pipeline_filename)
# Get or create an experiment and submit a pipeline run
import kfp
client = kfp.Client()
experiment = client.create_experiment('Titanic')
# Submit a pipeline run
run_name = 'titanic-ml-j9crb_run'
run_result = client.run_pipeline(
experiment.id, run_name, pipeline_filename, {})
titanic-dataset-ml
It seems like the old issue of unbound PVC is not solved with the workshop branch, the component loaddata
cannot be finished:This step is in Pending state with this message: Unschedulable: pod has unbound immediate PersistentVolumeClaims (repeated 2 times)
Status of the automatically generated PVC:
Name: titanic-ml-xd79q-zdxbx-kale-marshal-pvc
Namespace: kubeflow
StorageClass: standard
Status: Pending
Volume:
Labels: <none>
Annotations: volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 38s (x597 over 22h) persistentvolume-controller Failed to provision volume with StorageClass "standard": invalid AccessModes [ReadWriteMany]: only AccessModes [ReadWriteOnce ReadOnlyMany] are supported
Mounted By: <none>
PS: No volumes mounts are manually defined in both experiments.
Hey @Felihong , I suspect you are running Kale in your own Kubeflow cluster and not in MiniKF, is that right? Currently we have everything working in MiniKF and in the next week's we will work to expand support to full Kubeflow cluster!
Can you provide more information about your environment?
Hi @StefanoFioravanzo , yes I'm running kale in a kubeflow cluster which is deployed on GKE, and it will be cool if kale could be expanded to full kubeflow cluster. Thank you and I'm very glad to help!
And here are some specifications of the pod where the notebook is running:
apiVersion: v1
kind: Pod
metadata:
annotations:
sidecar.istio.io/status: '{"version":"5f3ae3613c7945ef767cb9fd594596bc001ff3ab915f12e4379c0cb5648d2729","initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-certs"],"imagePullSecrets":null}'
generateName: kale-test
labels:
app: kale-conda-test
controller-revision-hash: kale-test-5f6587c7d5
notebook-name: kale-test
statefulset: kale-test
statefulset.kubernetes.io/pod-name: kale-test-0
name: kale-test-0
namespace: [USER_NAMESPACE]
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: kale-test
uid: 1cc5023c-0d37-11ea-ba66-42010a84025a
resourceVersion: "2762529"
selfLink: /api/v1/namespaces/[USER_NAMESPACE]/pods/kale-test-0
uid: 1d47af68-0d37-11ea-ba66-42010a84025a
spec:
containers:
- env:
- name: NB_PREFIX
value: /notebook/[USER_NAMESPACE]/kale-test
image: [IMAGE_NAME]
imagePullPolicy: IfNotPresent
name: kale-test
ports:
- containerPort: 8888
name: notebook-port
protocol: TCP
resources:
requests:
cpu: 500m
memory: 1Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-editor-token-6qblc
readOnly: true
workingDir: /home/jovyan
- args:
- proxy
- sidecar
- --domain
- $(POD_NAMESPACE).svc.cluster.local
- --configPath
- /etc/istio/proxy
- --binaryPath
- /usr/local/bin/envoy
- --serviceCluster
- kale-test.$(POD_NAMESPACE)
- --drainDuration
- 45s
- --parentShutdownDuration
- 1m0s
- --discoveryAddress
- istio-pilot.istio-system:15010
- --zipkinAddress
- zipkin.istio-system:9411
- --connectTimeout
- 10s
- --proxyAdminPort
- "15000"
- --concurrency
- "2"
- --controlPlaneAuthPolicy
- NONE
- --statusPort
- "15020"
- --applicationPorts
- "8888"
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: INSTANCE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: ISTIO_META_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: ISTIO_META_CONFIG_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: ISTIO_META_INTERCEPTION_MODE
value: REDIRECT
- name: ISTIO_METAJSON_LABELS
value: |
{"app":"kale-test","controller-revision-hash":"kale-test-5f6587c7d5","notebook-name":"kale-test","statefulset":"kale-test","statefulset.kubernetes.io/pod-name":"kale-test-0"}
image: docker.io/istio/proxyv2:1.1.6
imagePullPolicy: IfNotPresent
name: istio-proxy
ports:
- containerPort: 15090
name: http-envoy-prom
protocol: TCP
readinessProbe:
failureThreshold: 30
httpGet:
path: /healthz/ready
port: 15020
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "2"
memory: 128Mi
requests:
cpu: 10m
memory: 40Mi
securityContext:
readOnlyRootFilesystem: true
runAsUser: 1337
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/istio/proxy
name: istio-envoy
- mountPath: /etc/certs/
name: istio-certs
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-editor-token-6qblc
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: kale-test-0
initContainers:
- args:
- -p
- "15001"
- -u
- "1337"
- -m
- REDIRECT
- -i
- '*'
- -x
- ""
- -b
- "8888"
- -d
- "15020"
image: docker.io/istio/proxy_init:1.1.6
imagePullPolicy: IfNotPresent
name: istio-init
resources:
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 10m
memory: 10Mi
securityContext:
capabilities:
add:
- NET_ADMIN
runAsNonRoot: false
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
nodeName: [NODE_NAME]
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 100
serviceAccount: default-editor
serviceAccountName: default-editor
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir:
medium: Memory
name: dshm
- name: default-editor-token-6qblc
secret:
defaultMode: 420
secretName: default-editor-token-6qblc
- emptyDir:
medium: Memory
name: istio-envoy
- name: istio-certs
secret:
defaultMode: 420
optional: true
secretName: istio.default-editor
Hi @Felihong ,
Regarding the unbound PVC, Kale creates PVCs setting access mode to ReadWriteMany
. Does your cluster support RWX PVCs?
Hey @elikatsis ,
thanks for the info!
I'm using the volume type gcePersistentDisk
which is defined by default. Unfortunately it seems like it doesn't support RWX access mode yet https://kubernetes.io/docs/concepts/storage/persistent-volumes/#types-of-persistent-volumes
I guess in this case I should create another StorageClass and use some type like NFS
?
Btw should this volume be data volume
or workspace volume
which can be created in the kubeflow UI? I'm kind of confused...
Thank you in advance!
I guess in this case I should create another StorageClass and use some type like NFS?
That would work. But you should also set it as default storage class because it is the default storage class that gets chosen. We could add features to choose such options in the future.
Btw should this volume be data volume or workspace volume which can be created in the kubeflow UI? I'm kind of confused...
Volumes mounted on pipeline steps don't know anything about workspace volume or data volumes.
Let's say that they are all considered data volumes.
So, the marshal
volume, which is the one you have issues with, is a filesystem where data passed between steps are saved to and loaded from.
There is only one special case: If you use the default notebook image, along with your notebook server's workspace volume mounted under the same mount point (/home/jovyan
), then any installed library will be present during a step's execution.
Hi there 👋,
so my pod is using now a google Filestore
backed PV with RWX
access mode, which is dynamically provisioned.
kubectl get storageclass
NAME PROVISIONER AGE
nfs-client (default) cluster.local/nfs-cp-nfs-client-provisioner 39m
standard kubernetes.io/gce-pd 4d21h
The problem now seems different as before when I test run the example pipeline candy sharing
.
In the first component kale-marshal-volume
I can see the pipeline is now bounded successfully to a volume:
kale-marshal-volume-manifest
map[apiVersion:v1 metadata:map[name:candies-sharing-5zg0x-wmr5v-kale-marshal-pvc namespace:kubeflow
selfLink:/api/v1/namespaces/kubeflow/persistentvolumeclaims/candies-sharing-5zg0x-wmr5v-kale-marshal-pvc uid:5eff311e-14ea-11ea-a64c-42010a84021c resourceVersion:3100421 creationTimestamp:2019-12-02T09:59:03Z
annotations:map[pv.kubernetes.io/bind-completed:yes pv.kubernetes.io/bound-by-controller:yes volume.beta.kubernetes.io/storage-provisioner:cluster.local/nfs-cp-nfs-client-provisioner] finalizers:[kubernetes.io/pvc-protection]]
spec:map[resources:map[requests:map[storage:1Gi]] volumeName:pvc-5eff311e-14ea-11ea-a64c-42010a84021c storageClassName:nfs-client volumeMode:Filesystem accessModes:[ReadWriteMany]]
status:map[phase:Bound accessModes:[ReadWriteMany] capacity:map[storage:1Gi]] kind:PersistentVolumeClaim]
kale-marshal-volume-name
candies-sharing-5zg0x-wmr5v-kale-marshal-pvc
kale-marshal-volume-size
1Gi
However in the second component sack
, I met an error somehow related to snapshot
:
Traceback (most recent call last):
File "<string>", line 36, in <module>
File "<string>", line 16, in sack
File "/opt/conda/lib/python3.7/site-packages/kale/utils/pod_utils.py", line 171, in snapshot_pipeline_step
from rok_gw_client.client import RokClient
ModuleNotFoundError: No module named 'rok_gw_client'
I also tried to manually define the volume and also created/defined a snapshot for that, the error stays the same. Any ideas? Thank you!
@Felihong It looks like you are now able to correctly provision and bind a volume, that is great. The issue now is that we have been building images using the rok_gw_client
library, that is not publicly available. @elikatsis we should make sure that rok_gw_client
is not a hard dependency, and if it fails to import then rok integration is disabled.
Opened #20 to track this
@Felihong, I'm glad you setup a storage class that can provide RWX PVCs. [I'd been trying to find out what's up with the glusterfs
issue but couldn't find any info.]
I've also figured out what is wrong with the pipeline name and experiment name issue you've mentioned in your first comments. Thank you for reporting it! A fix for that will be included in upcoming releases.
The rok_gw_client
issue you mention should only occur if you have the Take Rok snapshots before each step
UI option switched on. Try turning it off when deploying the pipeline.
Edit: We will make sure that Rok/MiniKF specific options are disabled when these features are properly released.
@elikatsis Thanks for pointing out! Looking forward to the new releases :)
Regarding the rod_gw_client
issue, I didn't managed to locate the Take Rok snapshots before each step
option in the extension. Do you actually mean the KALE DEPLOYMENT PANEL
?
The truth is I only defined the name of experiment and pipeline, no volumes are from my side defined. (I don't even have a snapshot API enabled in my cluster).
Is there a way to do some edits in my local installed pod_utils.py
script to disable the rok integration?
Regarding the rod_gw_client issue, I didn't managed to locate the Take Rok snapshots before each step option in the extension. Do you actually mean the KALE DEPLOYMENT PANEL?
Yes, in KALE DEPLOYMENT PANEL
you should set your volumes pane like this:
I believe it should work if you set these options as such. Rok is imported lazily, only when it is called, so if you disable all related options [which should be possible at this moment], then it should work. Please report back if you try it and it doesn't work.
Is there a way to do some edits in my local installed pod_utils.py script to disable the rok integration?
That would not be very easy. You would only modify the current container's filesystem, not the docker image which is used for the pods.
You would have to build a new docker image with custom Kale installation and pass that to all steps via Additional Settings
.
But if you, let's say, delete lines that seem related to Rok and snapshotting, that would be different than disabling features so something could break.
Hi @elikatsis , thank you so much for your suggestion! I now noticed it is my container not up-to-date that I can't get the latest extension version.
Good news is I tried with this image gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop
according to https://codelabs.developers.google.com/codelabs/cloud-kubeflow-minikf-kale/#3 in my cluster to run the base-example
, and set the volume panel as above, it works perfectly! 😊
Regarding my notebook image, I used the following commands to integrate and build kale and kale jupyterlab extension (based on jupyterlab 1.1.1) in my Dockerfile:
RUN pip install git+https://github.com/kubeflow-kale/kale.git@kubecon-workshop
RUN git clone https://github.com/kubeflow-kale/jupyterlab-kubeflow-kale.git \
&& cd jupyterlab-kubeflow-kale \
&& jlpm install \
&& jlpm run build \
&& jupyter labextension install .
It works but obviously doesn't lead me to the latest version. Did I miss some points here?
And would that be possible to share me some details (Dockerfile etc.) about image gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop
?
Thanks!
JupyterLab Extension development also lives in a kubecon-workshop
branch. Using those commands you have installed the master
version.
gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop
image has
this as base image: gcr.io/kubeflow-images-public/tensorflow-1.14.0-notebook-cpu:v-base-ef41372-1177829795472347138
.
Then we install Rok, latest KFP and Kale, Kale JupyterLab extension from kubecon-workshop
branch.
Finally, we run jupyter lab
instead of jupyter notebook
.
Hi @elikatsis , you are right about the branch, I mis-pulled master branch.
Now I'm using the kubecon-workshop
branch and everything works just fine!
Thank you!
Hi there,
first of all thanks for developing such a great and useful tool!
I installed kale in my kubeflow notebook server based on GKE (with snapshot created) and cloned the
titanic
example to give it a try.The pipeline can be successfully complied, and uploaded, however the
loaddata
component cannot be completed as there comes a warning ofThis step is in Pending state with this message: Unschedulable: pod has unbound immediate PersistentVolumeClaims (repeated 2 times)
.And here's the log file
It would be very appreciated if someone could maybe kindly point me out whether I configured the volumes correctly. Thanks!