opendatahub-io / notebooks

Notebook images for ODH
Apache License 2.0
15 stars 49 forks source link

Intel tensorflow notebook failed to get tested on OCP-CI #562

Open atheo89 opened 3 weeks ago

atheo89 commented 3 weeks ago

What steps did you take and what happened:

The builds are getting a fail with various errors

such as https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/opendatahub-io_notebooks/554/pull-ci-opendatahub-io-notebooks-main-notebooks-e2e-tests/1799062273747062784

 statefulset.apps/jupyter-intel-tensorflow-ubi9-python-3-9-notebook created
# Running tests for jupyter-intel-tensorflow-ubi9-python-3-9 notebook...
# Verify the notebook's readiness by pinging the /api endpoint
bin/kubectl wait --for=condition=ready pod -l app=jupyter-intel-tensorflow-ubi9-python-3-9 --timeout=600s
error: timed out waiting for the condition on pods/jupyter-intel-tensorflow-ubi9-python-3-9-notebook-0
make: *** [Makefile:389: test-jupyter-intel-tensorflow-ubi9-python-3.9] Error 1
{"component":"entrypoint","error":"wrapped process failed: exit status 2","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-06-07T15:02:23Z"}
error: failed to execute wrapped command: exit status 2 
INFO[2024-06-07T15:02:24Z] Step notebooks-e2e-tests-jupyter-intel-tf-ubi9-python-3.9-test-e2e failed after 10m9s. 
INFO[2024-06-07T15:02:24Z] Step phase test failed after 39m58s.

or https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/opendatahub-io_notebooks/554/pull-ci-opendatahub-io-notebooks-main-notebooks-e2e-tests/1799474409388380160

 bin/kubectl port-forward svc/jupyter-intel-tensorflow-ubi9-python-3-9-notebook 8888:8888 & curl --retry 5 --retry-delay 5 --retry-connrefused http://localhost:8888/notebook/opendatahub/jovyan/api ; EXIT_CODE=$?; echo && pkill --full "^bin/kubectl.*port-forward.*"; \
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0Warning: Transient problem: connection refused Will retry in 5 seconds. 5 
Warning: retries left.
Forwarding from 127.0.0.1:8888 -> 8888
Forwarding from [::1]:8888 -> 8888

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0Handling connection for 8888
E0608 17:13:06.596980      87 portforward.go:406] an error occurred forwarding 8888 -> 8888: error forwarding port 8888 to pod c6d6c3701ca83ce4403a47a7a515d9bb18bb0844c3ee790e85735d88080e1b2c, uid : port forward into network namespace "/var/run/netns/1f1daabb-d736-479e-87c1-11ac527e9937": failed to connect to localhost:8888 inside namespace c6d6c3701ca83ce4403a47a7a515d9bb18bb0844c3ee790e85735d88080e1b2c: dial tcp [::1]:8888: connect: connection refused
E0608 17:13:06.597994      87 portforward.go:234] lost connection to pod

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (52) Empty reply from server
make: *** [Makefile:390: test-jupyter-intel-tensorflow-ubi9-python-3.9] Error 1
{"component":"entrypoint","error":"wrapped process failed: exit status 2","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-06-08T17:13:06Z"}
error: failed to execute wrapped command: exit status 2 
INFO[2024-06-08T17:13:07Z] Step notebooks-e2e-tests-jupyter-intel-tf-ubi9-python-3.9-test-e2e failed after 4m12s. 

or https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/opendatahub-io_notebooks/554/pull-ci-opendatahub-io-notebooks-main-notebooks-e2e-tests/1799755627656908800

(this is a kubernetes bug https://github.com/kubernetes/kubectl/issues/1516)

 # Running tests for jupyter-minimal-ubi8-python-3-8 notebook...
# Verify the notebook's readiness by pinging the /api endpoint
bin/kubectl wait --for=condition=ready pod -l app=jupyter-minimal-ubi8-python-3-8 --timeout=600s
error: no matching resources found
make: *** [Makefile:389: test-jupyter-minimal-ubi8-python-3.8] Error 1
{"component":"entrypoint","error":"wrapped process failed: exit status 2","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-06-09T11:16:16Z"}
error: failed to execute wrapped command: exit status 2 
INFO[2024-06-09T11:16:17Z] Step notebooks-e2e-tests-jupyter-minimal-ubi8-python-3.8-test-e2e failed after 14s. 
INFO[2024-06-09T11:16:17Z] Step phase test failed after 18s.  

or even https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/opendatahub-io_notebooks/554/pull-ci-opendatahub-io-notebooks-main-habana-notebooks-e2e-tests/1799062273579290624

 INFO[2024-06-10T11:53:19Z] Ran for 2h48m46s                             
ERRO[2024-06-10T11:53:19Z] Some steps failed:                           
ERRO[2024-06-10T11:53:19Z] 
  * could not run steps: step notebooks-e2e-tests failed: "notebooks-e2e-tests" pre steps failed: "notebooks-e2e-tests" pod "notebooks-e2e-tests-ipi-install-install" failed: could not watch pod: the pod ci-op-wlchhsii/notebooks-e2e-tests-ipi-install-install failed after 56m50s (failed containers: test): ContainerFailed one or more containers exited
Container test exited with code 1, reason Error
---
mt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: {"code":"ACCT-MGMT-15","href":"/api/accounts_mgmt/v1/errors/15","id":"15","kind":"Error","operation_id":"14011fc6-72f3-4074-9019-e115b5daab3c","reason":"Unable to get payload details from JWT token: Unable to retrieve JWT token from request context"}
level=info msg=Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: {"errors":[{"detail":"UHC services authentication failed","meta":{"response_by":"gateway"},"status":401}]}
level=info
level=info msg=Cluster operator network ManagementStateDegraded is False with : 
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator authentication is not available
Installer exit with code 1

See comment on https://github.com/opendatahub-io/notebooks/pull/554#issuecomment-2156017703 for analysis. It appears the image is large and does lot's of things on startup, so it is not surprising it has trouble starting up.

What did you expect to happen: Run without errors

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Logs/Screenshots

jiridanek commented 2 weeks ago

https://issues.redhat.com/browse/RHOAIENG-8388