kubeflow / kfctl

kfctl is a CLI for deploying and managing Kubeflow
Apache License 2.0
181 stars 137 forks source link

How to troubleshoot the deployment of Kubeflow via the CLI on GCP? #391

Closed theophilegervet closed 4 years ago

theophilegervet commented 4 years ago

Hi,

I'm following instructions here to deploy Kubeflow via the CLI on GCP.

When I run kfctl apply -V -f ${CONFIG_FILE} after following all previous instructions I get

INFO[0000] Creating default token source                 filename="gcp/gcp.go:184"
INFO[0000] Creating GCP client.                          filename="gcp/gcp.go:196"
INFO[0000] 
****************************************************************
Notice anonymous usage reporting enabled using spartakus
To disable it
If you have already deployed it run the following commands:
  cd $(pwd)
  kubectl -n ${K8S_NAMESPACE} delete deploy -l app=spartakus

For more info: https://www.kubeflow.org/docs/other-guides/usage-reporting/
****************************************************************
  filename="coordinator/coordinator.go:120"
INFO[0000] .cache/manifests exists; not resyncing        filename="kfconfig/types.go:468"
INFO[0000] folder gcp_config exists, skip gcp.Generate   filename="gcp/gcp.go:2063"
INFO[0000] folder kustomize exists, skip kustomize.Generate  filename="kustomize/kustomize.go:372"
INFO[0000] .cache/manifests exists; not resyncing        filename="kfconfig/types.go:468"
INFO[0000] GCP client already configured                 filename="gcp/gcp.go:169"
INFO[0000] Reading config file: /Users/thophile/kf-deployments/kf-test/gcp_config/storage-kubeflow.yaml  filename="gcp/gcp.go:288"
INFO[0000] Reading import file: /Users/thophile/kf-deployments/kf-test/gcp_config/storage.jinja  filename="gcp/gcp.go:324"
INFO[0000] Updating deployment kf-test-storage           filename="gcp/gcp.go:445"
INFO[0001] Reading config file: /Users/thophile/kf-deployments/kf-test/gcp_config/cluster-kubeflow.yaml  filename="gcp/gcp.go:288"
INFO[0001] Reading import file: /Users/thophile/kf-deployments/kf-test/gcp_config/cluster.jinja  filename="gcp/gcp.go:324"
INFO[0002] Updating deployment kf-test                   filename="gcp/gcp.go:445"
INFO[0003] Updating kf-test-storage status: RUNNING (op = operation-1596773068307-5ac41b385fdec-2c0e545a-7f82106f)  filename="gcp/gcp.go:390"
INFO[0006] Updating kf-test-storage status: RUNNING (op = operation-1596773068307-5ac41b385fdec-2c0e545a-7f82106f)  filename="gcp/gcp.go:390"
ERRO[0012] Updating kf-test-storage error: &{Code:NO_METHOD_TO_UPDATE_FIELD Location: Message:No method found to update field 'zone' on resource 'kf-test-storage-metadata-store' of type 'compute.v1.disk'. The resource may need to be recreated with the new field. ForceSendFields:[] NullFields:[]}  filename="gcp/gcp.go:386"
ERRO[0012] Updating kf-test-storage error: &{Code:NO_METHOD_TO_UPDATE_FIELD Location: Message:No method found to update field 'https' on resource 'kf-test-storage-metadata-store' of type 'compute.v1.disk'. The resource may need to be recreated with the new field. ForceSendFields:[] NullFields:[]}  filename="gcp/gcp.go:386"
ERRO[0012] Updating kf-test-storage error: &{Code:NO_METHOD_TO_UPDATE_FIELD Location: Message:No method found to update field 'zone' on resource 'kf-test-storage-artifact-store' of type 'compute.v1.disk'. The resource may need to be recreated with the new field. ForceSendFields:[] NullFields:[]}  filename="gcp/gcp.go:386"
ERRO[0012] Updating kf-test-storage error: &{Code:NO_METHOD_TO_UPDATE_FIELD Location: Message:No method found to update field 'https' on resource 'kf-test-storage-artifact-store' of type 'compute.v1.disk'. The resource may need to be recreated with the new field. ForceSendFields:[] NullFields:[]}  filename="gcp/gcp.go:386"
Error: failed to apply:  (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp:  (kubeflow.error): Code 400 with message: gcp apply could not update deployment manager Error could not update deployment manager entries; Updating kf-test-storage error(400): BAD REQUEST
Usage:
  kfctl apply -f ${CONFIG} [flags]

Flags:
  -f, --file string   Static config file to use. Can be either a local path:
                            export CONFIG=./kfctl_gcp_iap.yaml
                        or a URL:
                            export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.0.yaml
                            export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_istio_dex.v1.0.0.yaml
                            export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_aws.v1.0.0.yaml
                            export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.0.yaml
                        kfctl apply -V --file=${CONFIG}
  -h, --help          help for apply
  -V, --verbose       verbose output default is false

failed to apply:  (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp:  (kubeflow.error): Code 400 with message: gcp apply could not update deployment manager Error could not update deployment manager entries; Updating kf-test-storage error(400): BAD REQUEST

In attempting to troubleshoot, I ran gcloud --project=${PROJECT} deployment-manager deployments describe ${KF_NAME}, which gives

---
fingerprint: YDZ2qtX5gyAKbxVu1_eDDg==
id: '747509287328219490'
insertTime: '2020-08-06T19:42:21.144-07:00'
name: kf-test
operation:
  endTime: '2020-08-06T21:04:36.399-07:00'
  error:
    errors:
    - code: NO_METHOD_TO_UPDATE_FIELD
      message: No method found to update field 'parent' on resource 'kf-test' of type
        'container-v1beta1'. The resource may need to be recreated with the new field.
    - code: NO_METHOD_TO_UPDATE_FIELD
      message: No method found to update field 'zone' on resource 'kf-test' of type
        'container-v1beta1'. The resource may need to be recreated with the new field.
  name: operation-1596773069714-5ac41b39b727b-7f54872e-a7e4340c
  operationType: update
  progress: 100
  startTime: '2020-08-06T21:04:29.832-07:00'
  status: DONE
  user: theophile.gervet@gmail.com
update:
  manifest: https://www.googleapis.com/deploymentmanager/v2/projects/kubeflow-test-285621/global/deployments/kf-test/manifests/manifest-1596773069790
NAME                 TYPE                                                               STATE      INTENT
kf-test              gcp-types/container-v1beta1:projects.locations.clusters            COMPLETED
kf-test-admin        iam.v1.serviceAccount                                              COMPLETED
kf-test-gpu-pool-v1  gcp-types/container-v1beta1:projects.locations.clusters.nodePools  FAILED     CREATE_OR_ACQUIRE
kf-test-ip           compute.v1.globalAddress                                           COMPLETED
kf-test-user         iam.v1.serviceAccount                                              COMPLETED
kf-test-vm           iam.v1.serviceAccount                                              COMPLETED

I am clueless as to what to do next... Thanks a lot for your help!

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.60
area/kfctl 0.93
platform/gcp 0.82

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/question 0.62

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

theophilegervet commented 4 years ago

Fixed this by deleting previous failed deployments.