kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
420 stars 211 forks source link

Implement E2E for integration with scheduler-plugins #540

Closed tenzen-y closed 1 year ago

tenzen-y commented 1 year ago

I implemented E2E for integrating with scheduler-plugins.

Part-of: #500

NOTE: This test will still fail since I forgot to implement the logic to update PodGroup when mpiJob.spec.runPolicy.schedulingPolicy is updated. So we must implement the logic first in another PR.

resolved in: #542

tenzen-y commented 1 year ago

blocked on #542.

tenzen-y commented 1 year ago

The build/base/intel.Dockerfile seems to be broken... Maybe, We must fix the Dockerfile.

Fetched 10.2 kB in 0s (20.4 kB/s) Reading package lists... E: Failed to fetch https://apt.repos.intel.com/oneapi/dists/all/main/binary-amd64/Packages.bz2 File has unexpected size (265446 != 461276). Mirror sync in progress? [IP: 184.87.69.109 443] Hashes of expected file:

  • Filesize:461276 [weak]
  • SHA512:b57998a876a5016443cc926dcd890a47c0e579b64a87b5fed7566bf03e403a352c1c04bee9493016927f9fa3001d1faffc725b34174daa5f427a02feb86f650f
  • SHA256:20d2c9441b5b7b725b3105bb552c5be21c8a4562ba4985ab2794d78f0d5aad23
  • SHA1:289a921381b794a7e96f270244f4d2b18ae55d90 [weak]
  • MD5Sum:11d742e8223bc078a46da9984296f744 [weak] Release file created at: Mon, 27 Mar 2023 16:38:50 +0000 E: Failed to fetch https://apt.repos.intel.com/oneapi/dists/all/main/binary-all/Packages.bz2
    E: Some index files failed to download. They have been ignored, or old ones used instead. The command '/bin/sh -c apt update && apt install -y --no-install-recommends gnupg2 ca-certificates && apt-key add /tmp/key.PUB && rm /tmp/key.PUB && echo "deb https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list && apt remove -y gnupg2 ca-certificates && apt autoremove -y && apt update && apt install -y --no-install-recommends dnsutils intel-oneapi-mpi && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100 make: *** [Makefile:107: test_images] Error 100

https://github.com/kubeflow/mpi-operator/actions/runs/4577714013/jobs/8083448553#step:4:1601

tenzen-y commented 1 year ago

I can build the image on my local. That error seems temporary.

tenzen-y commented 1 year ago

if you force push, it should trigger a rerun

Maybe, we must wait for the error to be fixed:

https://community.intel.com/t5/oneAPI-Registration-Download/OneApi-apt-repository-seems-broken/m-p/1361597

alculquicondor commented 1 year ago

is this ready for review now?

tenzen-y commented 1 year ago

is this ready for review now?

I'm still working.

tenzen-y commented 1 year ago

@alculquicondor Thanks for your patience. This PR is ready for review. PTAL :)

tenzen-y commented 1 year ago

Oh, this is a bug... I will create a separate PR to fix that.

W0403 20:47:56.968863   15661 podgroup.go:314] Ignore replica "Launcher" priority class "non-existence": priorityclass.scheduling.k8s.io "non-existence" not found
    podgroup_test.go:624: Unexpected calculatePGMinResources for the scheduler-plugins (-want,+got):
          &v1.ResourceList{
        -   s"cpu":    {i: resource.int64Amount{value: 7}, s: "7", Format: "DecimalSI"},
        +   s"cpu":    {i: resource.int64Amount{value: 12}, Format: "DecimalSI"},
        -   s"memory": {i: resource.int64Amount{value: 19327352832}, s: "18Gi", Format: "BinarySI"},
        +   s"memory": {i: resource.int64Amount{value: 36507222016}, Format: "BinarySI"},
          }

https://github.com/kubeflow/mpi-operator/actions/runs/4601155665/jobs/8128664833?pr=540#step:8:208

google-oss-prow[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)~~ [alculquicondor] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
alculquicondor commented 1 year ago

/hold

alculquicondor commented 1 year ago

/hold cancel

tenzen-y commented 1 year ago

@alculquicondor squashed.

tenzen-y commented 1 year ago

@alculquicondor Can you add a lgtm label to this PR?

alculquicondor commented 1 year ago

/lgtm

alculquicondor commented 1 year ago

Oh, this is a bug...

so the test passes sometimes?

tenzen-y commented 1 year ago

Oh, this is a bug...

so the test passes sometimes?

Yes, our UTs sometimes pass. Please take a look at #543.