os-climate / aicoe-osc-demo

This repository is the central location for the demos the ET data science team is developing within the OS-Climate project. This demo shows how to use the tools provided by Open Data Hub (ODH) running on the Operate First cluster to perform ETL, create training and inference pipelines.
Apache License 2.0
11 stars 25 forks source link

Reorganize README to split container infra from pipeline construction #200

Open MichaelTiemannOSC opened 2 years ago

MichaelTiemannOSC commented 2 years ago

I am trying to use the latest documentation to guide how to create a pipeline for the (still private) PCAF sovereign footprint POC. I appreciate that the AICoE demo is trying to address two audiences: those who are building the actual containers that will run the jobs, as well as those who are building the notebooks that need to use those containers, but which are much more concerned with the calculations within the notebooks and the topology of the notebooks, without so much concern for the underlying infrastructure.

For example, when I select Custom Elyra Notebook or AICoE Demo as a notebook type, how much of the infrastructure decisions can I expect to have already been made by that selection, requiring me to only make simple GUI-based selections within a constrained environment? And how much do I need to grovel in the details of copy-pasting and editing every line of a Dockerfile to get the right sort of "Hello, world" pipeline functionality?

Following along the demo video (https://www.youtube.com/watch?v=lGeT615YNlM) I do see that users must create both a YAML file and a Docker image to define the container image. When the demo shows the construction of pipelines, it does not mention how much additional work is needed behind the scenes to make the demo2 notebooks magically link up with all that the YAML file and Dockerfile imply. For a Jupyter notebook user, it does not explain how to even edit /opt/app-root/src/PCAF-sovereign-footprint/.aicoe-ci.yaml, which is a hidden file that the file browser cannot even open.

In the part of the video that shows how runtime images are selected (https://youtu.be/lGeT615YNlM?t=701) there is no mention of how to find the quay.io server, nor any explanation as to the relationship between what a project should magically inherit as a result of the AICoE template nor any OperateFirst instance values for projects that are part of an Op1st environment (such as os-climate). The requirement that os-climate needs to create a redhat.com account to access quay.io repositories is confusing as an ODH user in a different organization. (The readme does offer the name of a quay.io image that does take me to the right place, but that's buried way past where I run into trouble trying to follow other directions first.)

I tried using the default https://ml-pipeline-ui.kubeflow.svc.cluster.local:80/pipeline advertised by the documentation, but that did not work. I interpolated a different endpoint by scraping what is in the demo video browser URL and changing CL1 to CL2: http://ml-pipeline-ui.kubeflow.apps.odh-cl2.apps.os-climate.org/pipeline but that gave this error message:

Error making request
Failed to initialize `kfp.Client()` against: 'http://ml-pipeline-ui.kubeflow.apps.odh-cl2.apps.os-climate.org/pipeline' - Check Kubeflow Pipelines runtime configuration: 'pcaf_kubeflow'

Error details:
Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.8/site-packages/elyra/pipeline/kfp/processor_kfp.py", line 123, in process
    client = TektonClient(
  File "/opt/app-root/lib64/python3.8/site-packages/kfp/_client.py", line 161, in __init__
    if not self._context_setting['namespace'] and self.get_kfp_healthz().multi_user is True:
  File "/opt/app-root/lib64/python3.8/site-packages/kfp/_client.py", line 363, in get_kfp_healthz
    raise TimeoutError('Failed getting healthz endpoint after {} attempts.'.format(max_attempts))
TimeoutError: Failed getting healthz endpoint after 5 attempts.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.8/site-packages/tornado/web.py", line 1704, in _execute
    result = await result
  File "/opt/app-root/lib64/python3.8/site-packages/elyra/pipeline/handlers.py", line 120, in post
    response = await PipelineProcessorManager.instance().process(pipeline)
  File "/opt/app-root/lib64/python3.8/site-packages/elyra/pipeline/processor.py", line 134, in process
    res = await asyncio.get_event_loop().run_in_executor(None, processor.process, pipeline)
  File "/usr/lib64/python3.8/asyncio/futures.py", line 260, in __await__
    yield self  # This tells Task to wait for completion.
  File "/usr/lib64/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
  File "/usr/lib64/python3.8/asyncio/futures.py", line 178, in result
    raise self._exception
  File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/app-root/lib64/python3.8/site-packages/elyra/pipeline/kfp/processor_kfp.py", line 148, in process
    raise RuntimeError(
RuntimeError: Failed to initialize `kfp.Client()` against: 'http://ml-pipeline-ui.kubeflow.apps.odh-cl2.apps.os-climate.org/pipeline' - Check Kubeflow Pipelines runtime configuration: 'pcaf_kubeflow'
Check the JupyterLab log for more details at 2022-08-26 09:39:48

Happy to try again with some guidance.

schwesig commented 2 years ago

/kind bug

sesheta commented 2 years ago

@schwesig: The label(s) kind/bug cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/os-climate/aicoe-osc-demo/issues/200#issuecomment-1228648231): >/kind bug Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.