kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

e2e test failed sometime #570

Closed lowang-bh closed 1 year ago

lowang-bh commented 1 year ago

https://github.com/kubeflow/mpi-operator/actions/runs/5341853866/jobs/9683080365?pr=569

• [SLOW TEST:48.021 seconds]
MPIJob
/home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:39
  with OpenMPI implementation
  /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:85
    when running as root
    /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:100
      when running with host network
      /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:130
        should succeed
        /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:140
------------------------------
W0622 05:27:11.815081   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Launcher.template.metadata.creationTimestamp"
W0622 05:27:11.815092   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Worker.template.metadata.creationTimestamp"
• [SLOW TEST:53.020 seconds]
MPIJob
/home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:39
  with OpenMPI implementation
  /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:85
    when running as non-root
    /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:147
      should succeed
      /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:164
------------------------------
W0622 05:28:04.845095   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Launcher.template.metadata.creationTimestamp"
W0622 05:28:04.845129   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Worker.template.metadata.creationTimestamp"
• [SLOW TEST:60.029 seconds]
MPIJob
/home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:39
  with Intel Implementation
  /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:172
    when running as root
    /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:211
      should succeed
      /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:212
------------------------------
W0622 05:29:04.866083   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Launcher.template.metadata.creationTimestamp"
W0622 05:29:04.866092   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Worker.template.metadata.creationTimestamp"
• [SLOW TEST:60.017 seconds]
MPIJob
/home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:39
  with Intel Implementation
  /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:172
    when running as non-root
    /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:218
      should succeed
      /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:232
------------------------------
W0622 05:30:04.889364   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Launcher.template.metadata.creationTimestamp"
W0622 05:30:04.889375   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Worker.template.metadata.creationTimestamp"
• [SLOW TEST:57.519 seconds]
MPIJob
/home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:39
  with MPICH Implementation
  /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:239
    when running as root
    /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:278
      should succeed
      /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:279
------------------------------
W0622 05:31:02.404477   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Launcher.template.metadata.creationTimestamp"
W0622 05:31:02.404867   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Worker.template.metadata.creationTimestamp"
• [SLOW TEST:57.514 seconds]
MPIJob
/home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:39
  with MPICH Implementation
  /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:239
    when running as non-root
    /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:285
      should succeed
      /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:299
------------------------------
NAME: scheduler-plugins
LAST DEPLOYED: Thu Jun 22 05:32:02 2023
NAMESPACE: scheduler-plugins
STATUS: deployed
REVISION: 1
TEST SUITE: None
W0622 05:32:09.086927   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Launcher.template.metadata.creationTimestamp"
W0622 05:32:09.086941   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Worker.template.metadata.creationTimestamp"
W0622 05:32:26.727324   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Launcher.template.metadata.creationTimestamp"
W0622 05:32:26.727443   31530 warnings.go:70] unknown field "spec.mpiReplicaSpecs.Worker.template.metadata.creationTimestamp"
release "scheduler-plugins" uninstalled
namespace "scheduler-plugins" deleted
• [SLOW TEST:72.533 seconds]
MPIJob
/home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:39
  with scheduler-plugins
  /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:306
    should create pending pods
    /home/runner/work/mpi-operator/mpi-operator/test/e2e/mpi_job_test.go:376
------------------------------
Deleting cluster "kind" ...
panic: test timed out after 10m0s

goroutine 1079 [running]:
testing.(*M).startAlarm.func1()
    /opt/hostedtoolcache/go/1.19.10/x64/src/testing/testing.go:2036 +0x8e
created by time.goFunc
    /opt/hostedtoolcache/go/1.19.10/x64/src/time/sleep.go:176 +0x32

goroutine 1 [chan receive, 10 minutes]:
testing.(*T).Run(0xc0002e2b60, {0x1874bf0?, 0x51e1c5?}, 0x19543f8)
    /opt/hostedtoolcache/go/1.19.10/x64/src/testing/testing.go:1494 +0x37a
testing.runTests.func1(0xc00036cfc0?)
    /opt/hostedtoolcache/go/1.19.10/x64/src/testing/testing.go:1846 +0x6e
testing.tRunner(0xc0002e2b60, 0xc0000fbcd8)
    /opt/hostedtoolcache/go/1.19.10/x64/src/testing/testing.go:1446 +0x10b
testing.runTests(0xc000001220?, {0x26c22e0, 0x1, 0x1}, {0xc0000b4ed0?, 0x40?, 0x26f8240?})
    /opt/hostedtoolcache/go/1.19.10/x64/src/testing/testing.go:1844 +0x456
testing.(*M).Run(0xc000001220)
    /opt/hostedtoolcache/go/1.19.10/x64/src/testing/testing.go:1726 +0x5d9
main.main()
    _testmain.go:47 +0x1aa

goroutine 38 [syscall]:
syscall.Syscall6(0x3?, 0x3?, 0x15555555555555?, 0x7f2178d51ff0?, 0x24?, 0x0?, 0x8?)
    /opt/hostedtoolcache/go/1.19.10/x64/src/syscall/syscall_linux.go:90 +0x36
os.(*Process).blockUntilWaitable(0xc0006[3439](https://github.com/kubeflow/mpi-operator/actions/runs/5341853866/jobs/9683080365?pr=569#step:4:3440)0)
    /opt/hostedtoolcache/go/1.19.10/x64/src/os/wait_waitid.go:32 +0x87
os.(*Process).wait(0xc000634390)
    /opt/hostedtoolcache/go/1.19.10/x64/src/os/exec_unix.go:22 +0x28
os.(*Process).Wait(...)
    /opt/hostedtoolcache/go/1.19.10/x64/src/os/exec.go:132
os/exec.(*Cmd).Wait(0xc0005562c0)
    /opt/hostedtoolcache/go/1.19.10/x64/src/os/exec/exec.go:599 +0x4b
os/exec.(*Cmd).Run(0x187dc40?)
    /opt/hostedtoolcache/go/1.19.10/x64/src/os/exec/exec.go:437 +0x39
github.com/kubeflow/mpi-operator/test/e2e.runCommand({0x187dc40?, 0xc0007db5f8?}, {0xc0007db5d8?, 0x0?, 0x4a97e6?})
    /home/runner/work/mpi-operator/mpi-operator/test/e2e/e2e_suite_test.go:201 +0x79
github.com/kubeflow/mpi-operator/test/e2e.glob..func3()
    /home/runner/work/mpi-operator/mpi-operator/test/e2e/e2e_suite_test.go:132 +0x179
github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0x0?)
    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/runner.go:113 +0xb1
github.com/onsi/ginkgo/internal/leafnodes.(*runner).run(0x15bbcc0?)
    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/runner.go:64 +0x125
github.com/onsi/ginkgo/internal/leafnodes.(*synchronizedAfterSuiteNode).Run(0xc000000fa0, 0x1, 0x1, {0x0, 0x0})
    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/synchronized_after_suite_node.go:30 +0x7e
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runAfterSuite(0xc000455e40)
    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:138 +0xce
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run(0xc000455e40)
    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:71 +0xdd
github.com/onsi/ginkgo/internal/suite.(*Suite).Run(0xc00012aaf0, {0x7f2178d825c0, 0xc0002e2d00}, {0x1877599, 0x9}, {0xc0000b1c80, 0x1, 0x1}, {0x1ac8980, 0xc0001108c0}, ...)
    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/suite/suite.go:79 +0x4e5
github.com/onsi/ginkgo.runSpecsWithCustomReporters({0x1ab41a0?, 0xc0002e2d00}, {0x1877599, 0x9}, {0xc000069718, 0x1, 0xf?})
    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/ginkgo_dsl.go:245 +0x189
github.com/onsi/ginkgo.RunSpecs({0x1ab41a0, 0xc0002e2d00}, {0x1877599, 0x9})
    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/ginkgo_dsl.go:220 +0x14a
github.com/kubeflow/mpi-operator/test/e2e.TestE2E(0x408599?)
    /home/runner/work/mpi-operator/mpi-operator/test/e2e/e2e_suite_test.go:99 +0x45
testing.tRunner(0xc0002e2d00, 0x19543f8)
    /opt/hostedtoolcache/go/1.19.10/x64/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
    /opt/hostedtoolcache/go/1.19.10/x64/src/testing/testing.go:1493 +0x35f

goroutine 39 [chan receive, 9 minutes]:
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).registerForInterrupts(0xc000455e40, 0x19543f8?)
    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:223 +0x9c
created by github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run
    /home/runner/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:60 +0x98

goroutine 41 [syscall, 9 minutes]:
os/signal.signal_recv()
    /opt/hostedtoolcache/go/1.19.10/x64/src/runtime/sigqueue.go:152 +0x2f
os/signal.loop()
    /opt/hostedtoolcache/go/1.19.10/x64/src/os/signal/signal_unix.go:23 +0x19
created by os/signal.Notify.func1.1
    /opt/hostedtoolcache/go/1.19.10/x64/src/os/signal/signal.go:151 +0x2a
Deleted nodes: ["kind-control-plane"]
FAIL    github.com/kubeflow/mpi-operator/test/e2e   600.442s
FAIL
make: *** [Makefile:78: test_e2e] Error 1
alculquicondor commented 1 year ago

what's the relevant error message?

tenzen-y commented 1 year ago

Maybe, this will be fixed by https://github.com/kubeflow/mpi-operator/pull/576.

alculquicondor commented 1 year ago

/close we can reopen if there are more failures.

google-oss-prow[bot] commented 1 year ago

@alculquicondor: Closing this issue.

In response to [this](https://github.com/kubeflow/mpi-operator/issues/570#issuecomment-1607371943): >/close >we can reopen if there are more failures. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.