opendatahub-io / notebooks

Notebook images for ODH
Apache License 2.0
17 stars 55 forks source link

Incorporate rocm-pytorch and rocm-tensorflow runtime images #626

Closed atheo89 closed 1 month ago

atheo89 commented 1 month ago

Related to: https://issues.redhat.com/browse/RHOAIENG-9680 Depends on: https://github.com/opendatahub-io/notebooks/pull/620

Description

Include PyTorch AMD and Tensorflow AMD images in the pre-included runtime image lists. We need to provide runtime images with AMD support so that they can be used in the Pipeline creation via Elyra component or directly.

NOTE: This PR has dependency of this one: https://github.com/opendatahub-io/notebooks/pull/620 as for the runtime builds the base image is amd-ubi9-python-3.9

Follow up items that will break up on different tracking tasks:

Merge criteria:

atheo89 commented 1 month ago

Build Notebooks (pr) / Generate job matrix (pull_request) Failing because this PR should be merged first. https://github.com/opendatahub-io/notebooks/pull/620 (Check the note on the PR description)

atheo89 commented 1 month ago

/retest-required

atheo89 commented 1 month ago

Based on CI builds ROCm Runtime TensorFlow build: ghcr.io/atheo89/notebooks/workbench-images:rocm-runtime-tensorflow-ubi9-python-3.9-RHOAIENG-9680_d8788081cac625bf2e1edf64ed8140c3c7223531

ROCm Runtime PyTorch build: ghcr.io/atheo89/notebooks/workbench-images:rocm-runtime-pytorch-ubi9-python-3.9-RHOAIENG-9680_d8788081cac625bf2e1edf64ed8140c3c7223531

atheo89 commented 1 month ago

This PR is ready for a final review

jiridanek commented 1 month ago

Is the openshift-ci still supposed to be failing?

RRO[2024-07-23T09:53:09Z] Some steps failed:
ERRO[2024-07-23T09:53:09Z]

  • could not sort nodes
  • steps are missing dependencies
  • step [images] is missing dependencies: <&api.externalImageLink{namespace:"", name:"stable", tag:"runtime-rocm-> pytorch-ubi9-python-3.9"}>, <&api.externalImageLink{namespace:"", name:"stable", tag:"runtime-rocm-tensorflow-ubi9-python-3.9"}>
  • step [output:stable:runtime-rocm-pytorch-ubi9-python-3.9] is missing dependencies: <&api.internalImageStreamTagLink{name:"pipeline", tag:"runtime-rocm-pytorch-ubi9-python-3.9", unsatisfiableError:""}>
  • step [output:stable:runtime-rocm-tensorflow-ubi9-python-3.9] is missing dependencies: <&api.internalImageStreamTagLink{name:"pipeline", tag:"runtime-rocm-tensorflow-ubi9-python-3.9", unsatisfiableError:""}>
  • step runtime-rocm-pytorch-ubi9-python-3.9 is missing dependencies: <&api.internalImageStreamTagLink{name:"pipeline", tag:"amd-ubi9-python-3.9", unsatisfiableError:""}>
  • step runtime-rocm-tensorflow-ubi9-python-3.9 is missing dependencies: <&api.internalImageStreamTagLink{name:"pipeline", tag:"amd-ubi9-python-3.9", unsatisfiableError:""}> INFO[2024-07-23T09:53:09Z] Reporting job state 'failed' with reason 'building_graph'
atheo89 commented 1 month ago

Is the openshift-ci still supposed to be failing?

Not sure what is happening on CI... There is a follow up PR however that incorporates more things https://github.com/openshift/release/pull/54579

atheo89 commented 1 month ago

/test runtime-rocm-tensorflow-ubi9-python-3-9-pr-image-mirror

atheo89 commented 1 month ago

The images fail to get build due to the node was low on resource: ephemeral-storage. Threshold quantity: 32127475555, available: 31169256Ki.

 * could not run steps: step rocm-ubi9-python-3.9 failed: error occurred handling build rocm-ubi9-python-3.9-amd64: build not successful after 5 attempts: [the build rocm-ubi9-python-3.9-amd64 failed after 15m15s with reason BuildPodEvicted: The node was low on resource: ephemeral-storage. Threshold quantity: 32127475555, available: 31169256Ki. Container docker-build was using 51577548Ki, request is 0, has larger consumption of ephemeral-storage. , the build rocm-ubi9-python-3.9-amd64 failed after 19m33s with reason BuildPodEvicted: The node was low on resource: ephemeral-storage. Threshold quantity: 32127475555, available: 30127664Ki. Container docker-build was using 45128112Ki, request is 0, has larger consumption of ephemeral-storage. 
atheo89 commented 1 month ago

/retest

atheo89 commented 1 month ago

/override ci/prow/runtimes-ubi9-e2e-tests /override ci/prow/runtimes-ubi8-e2e-tests

openshift-ci[bot] commented 1 month ago

@atheo89: Overrode contexts on behalf of atheo89: ci/prow/runtimes-ubi8-e2e-tests, ci/prow/runtimes-ubi9-e2e-tests

In response to [this](https://github.com/opendatahub-io/notebooks/pull/626#issuecomment-2247825923): >/override ci/prow/runtimes-ubi9-e2e-tests >/override ci/prow/runtimes-ubi8-e2e-tests Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
atheo89 commented 1 month ago

/override ci/prow/rocm-notebooks-e2e-tests

openshift-ci[bot] commented 1 month ago

@atheo89: Overrode contexts on behalf of atheo89: ci/prow/rocm-notebooks-e2e-tests

In response to [this](https://github.com/opendatahub-io/notebooks/pull/626#issuecomment-2247826931): >/override ci/prow/rocm-notebooks-e2e-tests Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
atheo89 commented 1 month ago

/override ci/prow/rocm-notebooks-e2e-tests /override ci/prow/runtimes-ubi8-e2e-tests /override ci/prow/runtimes-ubi9-e2e-tests

/test rocm-runtimes-ubi9-e2e-tests /test images

openshift-ci[bot] commented 1 month ago

@atheo89: Overrode contexts on behalf of atheo89: ci/prow/rocm-notebooks-e2e-tests, ci/prow/runtimes-ubi8-e2e-tests, ci/prow/runtimes-ubi9-e2e-tests

In response to [this](https://github.com/opendatahub-io/notebooks/pull/626#issuecomment-2249198361): >/override ci/prow/rocm-notebooks-e2e-tests >/override ci/prow/runtimes-ubi8-e2e-tests >/override ci/prow/runtimes-ubi9-e2e-tests > >/test rocm-runtimes-ubi9-e2e-tests >/test images > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
atheo89 commented 1 month ago

/test rocm-runtimes-ubi9-e2e-tests /test images

openshift-ci[bot] commented 1 month ago

@atheo89: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/notebook-rocm-ubi9-python-3-9-pr-image-mirror 767e2853a45a72eb18e0ac8800fb65a3f4d6484c link true /test notebook-rocm-ubi9-python-3-9-pr-image-mirror
ci/prow/amd-runtimes-ubi9-e2e-tests d8788081cac625bf2e1edf64ed8140c3c7223531 link true /test amd-runtimes-ubi9-e2e-tests
ci/prow/images d8788081cac625bf2e1edf64ed8140c3c7223531 link true /test images
ci/prow/rocm-runtimes-ubi9-e2e-tests d8788081cac625bf2e1edf64ed8140c3c7223531 link true /test rocm-runtimes-ubi9-e2e-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harshad16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/opendatahub-io/notebooks/blob/main/OWNERS)~~ [harshad16] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment