intel / vck

Volume Controller for Kubernetes
https://ai.intel.com/kubernetes-volume-controller-kvc-data-management-tailored-for-machine-learning-workloads-in-kubernetes/
Apache License 2.0
67 stars 17 forks source link

Append Pod logs to CR in case of data download failure #41

Closed ashahba closed 6 years ago

ashahba commented 6 years ago

Append KVC pod logs to CR Status message in case of download failure.

k8s-ci-robot commented 6 years ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To fully approve this pull request, please assign additional approvers. We suggest the following additional approver: jose5918

Assign the PR to them by writing /assign @jose5918 in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/experimental-kvc/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
k8s-ci-robot commented 6 years ago

Hi @ashahba. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/devel/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
ashahba commented 6 years ago

/label wip

ashahba commented 6 years ago

Once merged, fixes #30

ashahba commented 6 years ago

/label no-wip

ashahba commented 6 years ago
$ kubectl get volumemanager scratchpad-bad -o yaml
apiVersion: kvc.kubeflow.org/v1
kind: VolumeManager
metadata:
  clusterName: ""
  creationTimestamp: 2018-05-09T00:03:07Z
  generation: 0
  name: scratchpad-bad
  namespace: ashahba
  resourceVersion: "461786"
  selfLink: /apis/kvc.kubeflow.org/v1/namespaces/ashahba/volumemanagers/scratchpad-bad
  uid: 5a907fb8-531c-11e8-9a04-42010a8a00bf
spec:
  state: ""
  volumeConfigs:
  - accessMode: ReadWriteOnce
    capacity: 100Mi
    endpointURL: https://storage.googleapis.com
    id: scratchpad-bad
    labels:
      key1: scratchpad-bad
    options:
      awsCredentialsSecretName: gcs-creds
      dataPath: /var/datasets
      timeoutForDataDownload: 2m
    replicas: 2
    sourceType: S3
    sourceURL: s3://ashahba/scratchpad-bad/
status:
  message: failed to deploy all the sub-resources
  state: Failed
  volumes:
  - id: scratchpad-bad
    message: |
      error during data download using pod [name: kvc-resource-5a9396da-531c-11e8-9cab-0a580a300196]: Added `s3` successfully.
      mc: <ERROR> Unable to validate source s3/ashahba/scratchpad-bad/
    nodeAffinity: {}
    volumeSource: {}
ashahba commented 6 years ago
$ kubectl get volumemanager scratchpad-bad -o yaml
apiVersion: kvc.kubeflow.org/v1
kind: VolumeManager
metadata:
  clusterName: ""
  creationTimestamp: 2018-05-09T06:15:37Z
  generation: 0
  name: scratchpad-bad
  namespace: ashahba
  resourceVersion: "489483"
  selfLink: /apis/kvc.kubeflow.org/v1/namespaces/ashahba/volumemanagers/scratchpad-bad
  uid: 6444dbe4-5350-11e8-9a04-42010a8a00bf
spec:
  state: ""
  volumeConfigs:
  - accessMode: ReadWriteOnce
    capacity: 100Mi
    endpointURL: https://storage.googleapis.com
    id: scratchpad-bad
    labels:
      key1: scratchpad-bad
    options:
      awsCredentialsSecretName: gcs-creds
      dataPath: /var/datasets
      timeoutForDataDownload: 2m
    replicas: 2
    sourceType: S3
    sourceURL: s3://ashahba/scratchpad-bad/
status:
  message: failed to deploy all the sub-resources
  state: Failed
  volumes:
  - id: scratchpad-bad
    message: |
      error during data download and failed to fetch logs for pod [name: kvc-resource-6446d87f-5350-11e8-8c69-0a580a300199]: Added `s3` successfully.
      mc: <ERROR> Unable to validate source s3/ashahba/scratchpad-bad/
    nodeAffinity: {}
    volumeSource: {}
ashahba commented 6 years ago

@Ajay191191 and @balajismaniam let's agree on the message text and we go from there. I agree with Error during data download being generic enough and it should be ok, if we don't want to confuse the end users with more details. If that works I'll go with that.

balajismaniam commented 6 years ago

@ashahba The e2e tests are failing. Everything else looks good. We can merge as soon as the e2e passes.