kubeflow / kfctl

kfctl is a CLI for deploying and managing Kubeflow
Apache License 2.0
181 stars 138 forks source link

kfctl_existing postsubmit is failing #60

Open jlewi opened 5 years ago

jlewi commented 5 years ago

Here's the postusbmit test grid https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-postsubmit&group-by-target=&group-by-hierarchy-pattern=%5B%5Cw-%5D%2B

Lots of red. Here's a failed run. https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/kubeflow_kubeflow/kubeflow-postsubmit/1186679529897201664

jlewi commented 4 years ago

@yanniszark any update on this?

jlewi commented 4 years ago

Here's a recent run https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/kubeflow-periodic-master/1189264851759796234

http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-periodic-master-kfctl-go-existing-v07-6234-0336?tab=workflow

Here are the logs from the test

kubeflow-periodic-master-kfctl-go-existing-v07-6234-0336-3078399875.log.txt

This looks like a problem with the test. The cluster creation script is failing because the cluster already exists.

Since it looks like the problem is the test; I'm going to say that right now this is not release blocking.

jlewi commented 4 years ago

@yanniszark Do you agree that right now this isn't release blocking given we don't have signal indicating that Kubeflow is broken and not the test?

yanniszark commented 4 years ago

I thought there was another issue for this where I commented. The Kubeflow installation seems to be working fine installing by hand. I would like to know the reason why they're failing though, there may be a bug hidden in there. I will update this issue with my findings today.

yanniszark commented 4 years ago

@jlewi my findings so far:

  1. The initial installation fails because of this code: https://github.com/kubeflow/kubeflow/blob/7f64d8b023147927b74139bbdbbffa1ffca536bc/py/kubeflow/kubeflow/ci/kfctl_go_test_utils.py#L261

The v0.7 existing_arrikto config doesn't use a Plugin, so the test fails with KeyError.

  1. After fixing that, I get timeouts waiting for the minio deployment. Not sure what the issue is yet, could be just not waiting for enough time.
jlewi commented 4 years ago

@yanniszark Any update on this?

jlewi commented 4 years ago

@yanniszark any update?

yanniszark commented 4 years ago

@jlewi thanks for the ping. I haven't had much cycles to put into this, I will try to allocate some.

jlewi commented 4 years ago

@yanniszark any chance you will be able to work on this in the coming weeks? It would be great to have a working test in advance of the 1.0 release.

crobby commented 4 years ago

Any chance of an update on this? It would be great to have working tests for the 1.1 release. Thanks

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/testing 0.76

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.