intel / vck

Volume Controller for Kubernetes
https://ai.intel.com/kubernetes-volume-controller-kvc-data-management-tailored-for-machine-learning-workloads-in-kubernetes/
Apache License 2.0
67 stars 17 forks source link

Creating PVCs hang and causes all subsequent attempts to create PVc to hang as well if timeoutForDataDownload is a non quoted integer value. #54

Open ashahba opened 6 years ago

ashahba commented 6 years ago

This steps to recreated this is well described here #50

balajismaniam commented 6 years ago

This kind of error is already handled in https://github.com/kubeflow/experimental-kvc/blob/master/pkg/handlers/s3_handler.go#L75-L79. Also I checked it seems to work as expected: https://play.golang.org/p/xk4SdVKwX-B.

Can you send paste the exact CR yaml here and the commit you are using to reproduce this error? I will test it on my side.

balajismaniam commented 6 years ago

@ashahba do you get errors similar to cannot use 10 (type int) as type string in argument to time.ParseDuration in the controller logs?

The type for options is also map[string]string. I expect the non-quoted integer to be parsed as a string. May be there are some internal YAML parsing weirdness going on here.

ashahba commented 6 years ago

@balajismaniam I don't get any helpful error like that, but this one keeps being reported:

I0518 16:28:32.451026       1 reflector.go:240] Listing and watching *v1.VolumeManager from github.com/kubeflow/experimental-kvc/pkg/client/informers/externalversions/factory.go:60
E0518 16:28:32.453485       1 reflector.go:205] github.com/kubeflow/experimental-kvc/pkg/client/informers/externalversions/factory.go:60: Failed to list *v1.VolumeManager: v1.VolumeManagerList.Items: []v1.VolumeManager: v1.VolumeManager.Spec: v1.VolumeManagerSpec.VolumeConfigs: []v1.VolumeConfig: v1.VolumeConfig.Options: ReadString: expects " or n, but found 1, error found in #10 byte of ...|ownload":10},"replic|..., bigger context ...|SecretName":"gcs-creds","timeoutForDataDownload":10},"replicas":2,"sourceType":"S3","sourceURL":"s3:|...

here is the yaml file:

apiVersion: kvc.kubeflow.org/v1
kind: VolumeManager
metadata:
  name: scratchpad-notimeout-bad-hangs
  namespace: ashahba
spec:
  volumeConfigs:
    - id: "scratchpad-notimeout-bad-hangs"
      sourceType: "S3"
      sourceURL: "SOME_VALID_S3_PATH"
      accessMode: "ReadWriteOnce"
      endpointURL: "https://storage.googleapis.com"
      capacity: 100Mi
      replicas: 2
      labels:
        key1: scratchpad-notimeout-bad-hangs
      options:
        awsCredentialsSecretName: gcs-creds
        timeoutForDataDownload: 10