kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

Fix Unit Tests #293

Closed andreyvelich closed 3 years ago

andreyvelich commented 4 years ago

Related: https://github.com/kubeflow/pytorch-operator/issues/292.

I was able to run unit test locally after these changes .

  1. I changed expectedServiceDeletions from 0 to 1. Here: https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/job.go#L163-L171, we don't check that job.Spec.CleanPodPolicy == common.CleanPodPolicyRunning before deleting services. @jiaqianjing is that correct behaviour? In TF operator we have one loop for services and pods: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/job.go#L162-L172, but in PyTorch operator we analyse services and pods in the different loops.

  2. To fix this error: Fail in goroutine after TestCopyLabelsAndAnnotation has completed, I changed run test controller, like here: https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/controller_test.go#L341-L347.

Do you know how we can add travis check before PR submission?

/assign @gaocegege @johnugeorge /cc @jiaqianjing

k8s-ci-robot commented 4 years ago

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: jiaqianjing.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/kubeflow/pytorch-operator/pull/293): >Related: https://github.com/kubeflow/pytorch-operator/issues/292. > >I was able to run unit test locally after these changes . > >1. I changed `expectedServiceDeletions` from 0 to 1. Here: https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/job.go#L163-L171, we don't check that `job.Spec.CleanPodPolicy == common.CleanPodPolicyRunning` before deleting services. >@jiaqianjing is that correct behaviour? In TF operator we have one loop for services and pods: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/job.go#L162-L172, but in PyTorch operator we analyse services and pods in the different loops. > >2. To fix this error: `Fail in goroutine after TestCopyLabelsAndAnnotation has completed`, I changed run test controller, like here: https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/controller_test.go#L341-L347. > >Do you know how we can add travis check before PR submission? > >/assign @gaocegege @johnugeorge >/cc @jiaqianjing > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
andreyvelich commented 4 years ago

/hold Travis still fails: https://travis-ci.com/github/andreyvelich/pytorch-operator/builds/177732485. How we can re-sync with coveralls to fix error ?

Bad response status from coveralls: 422
{"message":"Couldn't find a repository matching this job.","error":true}
The command "goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/apis/pytorch/*/zz_generated.*.go"" exited with 1.
gaocegege commented 4 years ago

I think it is caused by the version of the codegen tools

andreyvelich commented 4 years ago

@gaocegege Any ideas which version can affect on this? Also, do you know how we can add travis to PR presubmit checks ?

gaocegege commented 4 years ago

Can we merge it now?

gaocegege commented 4 years ago

No idea about it..

Any ideas which version can affect on this?

andreyvelich commented 4 years ago

@gaocegege We can try to merge it, but unit test can still be failed because of goveralls error. Maybe we should try to migrate to travis-ci.comfrom travis-ci.org ? I believe PyTorch operator still uses old version: https://travis-ci.com/organizations/kubeflow/migrate.

gaocegege commented 4 years ago

https://travis-ci.com/organizations/kubeflow/migrate @jlewi Hi Jeremy, can you help do it?

jlewi commented 4 years ago

@andreyvelich and @gaocegege As WG leads can you please come up with an approach for maintaining and supporting your testing infrastructure that minimizes the burden on the GitHub org admins. If you need a GitHub org change please open an issue and assign it to one of the org admins (other than me) listed in: https://github.com/kubeflow/internal-acls/blob/11984d38190b030fecf6886379f1f0eda9fa0000/github-orgs/kubeflow/org.yaml#L7

gaocegege commented 4 years ago

As WG leads can you please come up with an approach for maintaining and supporting your testing infrastructure that minimizes the burden on the GitHub org admins.

I think the best approach is to give WG Leads the permission to manage projects in these test infras. I will try to promote it with GitHub org admins

jlewi commented 4 years ago

I think the best approach is to give WG Leads the permission to manage projects in these test infras

What does this mean in terms of GitHub permissions?

andreyvelich commented 4 years ago

@gaocegege @jlewi Will admin permission for pytorch-operator repository be enough to control Travis? Like what GitHub team has for KF Serving. Currently, we have write permission.

/cc @johnugeorge

gaocegege commented 4 years ago

Can we merge it now?

andreyvelich commented 4 years ago

@gaocegege Sure, let's create an issue to track this problem.

gaocegege commented 4 years ago

/lgtm /approve

andreyvelich commented 4 years ago

/hold cancel

andreyvelich commented 4 years ago

/retest

andreyvelich commented 4 years ago

@Jeffwan @jinchihe This PR is needed to fix Travis and e2e bugs.

@jinchihe Do you have any ideas how to fix: evel=error msg="No Major.Minor.Patch elements found. I saw that you have fixed it for tf-operator: https://github.com/kubeflow/tf-operator/pull/1179, but we deploy cluster from the bash script in pytorch-operator currently.

jinchihe commented 4 years ago

@jinchihe Do you have any ideas how to fix: evel=error msg="No Major.Minor.Patch elements found. I saw that you have fixed it for tf-operator: kubeflow/tf-operator#1179, but we deploy cluster from the bash script in pytorch-operator currently.

Yeah, I found that's OK if use old k8s cluster, but for now the k8s 1.14.x cannot be deployed any more.

andreyvelich commented 4 years ago

@jinchihe Do you have any ideas how to fix: evel=error msg="No Major.Minor.Patch elements found. I saw that you have fixed it for tf-operator: kubeflow/tf-operator#1179, but we deploy cluster from the bash script in pytorch-operator currently.

Yeah, I found that's OK if use old k8s cluster, but for now the k8s 1.14.x cannot be deployed any more.

@jinchihe We use the same command in Katib to deploy cluster and it is working (https://github.com/kubeflow/katib/blob/master/test/scripts/v1beta1/create-cluster.sh#L36), but use kubectl and kustomize to deploy Katib and operators.

I believe it fails on this step. Do you know how we should specify cluster version in ksonnet app?

jinchihe commented 4 years ago

Do you know how we should specify cluster version in ksonnet app?

Not sure, but guess here, need to confirm. Thanks. https://github.com/kubeflow/pytorch-operator/blob/61fefa88f75b126fd7672f44b87351db511299cb/scripts/create-cluster.sh#L36

k8s-ci-robot commented 4 years ago

@andreyvelich: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-pytorch-operator-presubmit 152b920b8a0499e25cf2a505665a7d2bf0ea7e1b link /test kubeflow-pytorch-operator-presubmit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
k8s-ci-robot commented 3 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS)~~ [gaocegege] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
PatrickXYS commented 3 years ago

/lgtm