Closed andreyvelich closed 3 years ago
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: jiaqianjing.
Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.
/hold Travis still fails: https://travis-ci.com/github/andreyvelich/pytorch-operator/builds/177732485. How we can re-sync with coveralls to fix error ?
Bad response status from coveralls: 422
{"message":"Couldn't find a repository matching this job.","error":true}
The command "goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/apis/pytorch/*/zz_generated.*.go"" exited with 1.
I think it is caused by the version of the codegen tools
@gaocegege Any ideas which version can affect on this? Also, do you know how we can add travis to PR presubmit checks ?
Can we merge it now?
No idea about it..
Any ideas which version can affect on this?
@gaocegege We can try to merge it, but unit test can still be failed because of goveralls
error.
Maybe we should try to migrate to travis-ci.com
from travis-ci.org
?
I believe PyTorch operator still uses old version: https://travis-ci.com/organizations/kubeflow/migrate.
https://travis-ci.com/organizations/kubeflow/migrate @jlewi Hi Jeremy, can you help do it?
@andreyvelich and @gaocegege As WG leads can you please come up with an approach for maintaining and supporting your testing infrastructure that minimizes the burden on the GitHub org admins. If you need a GitHub org change please open an issue and assign it to one of the org admins (other than me) listed in: https://github.com/kubeflow/internal-acls/blob/11984d38190b030fecf6886379f1f0eda9fa0000/github-orgs/kubeflow/org.yaml#L7
As WG leads can you please come up with an approach for maintaining and supporting your testing infrastructure that minimizes the burden on the GitHub org admins.
I think the best approach is to give WG Leads the permission to manage projects in these test infras. I will try to promote it with GitHub org admins
I think the best approach is to give WG Leads the permission to manage projects in these test infras
What does this mean in terms of GitHub permissions?
@gaocegege @jlewi Will admin permission for pytorch-operator repository be enough to control Travis? Like what GitHub team has for KF Serving. Currently, we have write permission.
/cc @johnugeorge
Can we merge it now?
@gaocegege Sure, let's create an issue to track this problem.
/lgtm /approve
/hold cancel
/retest
@Jeffwan @jinchihe This PR is needed to fix Travis and e2e bugs.
@jinchihe Do you have any ideas how to fix: evel=error msg="No Major.Minor.Patch elements found
.
I saw that you have fixed it for tf-operator: https://github.com/kubeflow/tf-operator/pull/1179, but we deploy cluster from the bash script in pytorch-operator currently.
@jinchihe Do you have any ideas how to fix: evel=error msg="No Major.Minor.Patch elements found. I saw that you have fixed it for tf-operator: kubeflow/tf-operator#1179, but we deploy cluster from the bash script in pytorch-operator currently.
Yeah, I found that's OK if use old k8s cluster, but for now the k8s 1.14.x cannot be deployed any more.
@jinchihe Do you have any ideas how to fix: evel=error msg="No Major.Minor.Patch elements found. I saw that you have fixed it for tf-operator: kubeflow/tf-operator#1179, but we deploy cluster from the bash script in pytorch-operator currently.
Yeah, I found that's OK if use old k8s cluster, but for now the k8s 1.14.x cannot be deployed any more.
@jinchihe We use the same command in Katib to deploy cluster and it is working (https://github.com/kubeflow/katib/blob/master/test/scripts/v1beta1/create-cluster.sh#L36), but use kubectl and kustomize to deploy Katib and operators.
I believe it fails on this step. Do you know how we should specify cluster version in ksonnet app?
Do you know how we should specify cluster version in ksonnet app?
Not sure, but guess here, need to confirm. Thanks. https://github.com/kubeflow/pytorch-operator/blob/61fefa88f75b126fd7672f44b87351db511299cb/scripts/create-cluster.sh#L36
@andreyvelich: The following test failed, say /retest
to rerun all failed tests:
Test name | Commit | Details | Rerun command |
---|---|---|---|
kubeflow-pytorch-operator-presubmit | 152b920b8a0499e25cf2a505665a7d2bf0ea7e1b | link | /test kubeflow-pytorch-operator-presubmit |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: andreyvelich, gaocegege
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/lgtm
Related: https://github.com/kubeflow/pytorch-operator/issues/292.
I was able to run unit test locally after these changes .
I changed
expectedServiceDeletions
from 0 to 1. Here: https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/job.go#L163-L171, we don't check thatjob.Spec.CleanPodPolicy == common.CleanPodPolicyRunning
before deleting services. @jiaqianjing is that correct behaviour? In TF operator we have one loop for services and pods: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/job.go#L162-L172, but in PyTorch operator we analyse services and pods in the different loops.To fix this error:
Fail in goroutine after TestCopyLabelsAndAnnotation has completed
, I changed run test controller, like here: https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/controller_test.go#L341-L347.Do you know how we can add travis check before PR submission?
/assign @gaocegege @johnugeorge /cc @jiaqianjing