[Questions] Regarding Data Science Pipelines

andreaTP commented 1 year ago

Hi all! It's amazing that we can actually look up the ADRs in this repository! Thanks a lot for the openness! 🙏

I was going through this one: https://github.com/opendatahub-io/architecture-decision-records/blob/main/ODH-ADR-0002-data-science-pipelines-multi-user-approach.md and I have some follow-up questions, I hope this is the right place 🙂

integrating Data Science Pipelines (upstream = Kubeflow Pipelines)

Here I read that Kubeflow is the technology to be used, is it a requirement? An assumption? Or have we evaluated alternatives and decided to use Kubeflow? In the latter case, I'm super interested in having access to the comparison!

we propose to roll out multiple individual single-user Kubeflow Pipelines stack into multiple namespaces

On the Kubeflow documentation, I can find that the minimum requirements are pretty significant, do we have estimations of the system requirements in the proposed configurations(e.g. shared vs. local minio and postgres) and adding the operator needs? how much do we expect this to scale on a user's cluster? Have we explored alternatives or do we expect the users to use a single installation in a dedicated namespace on each cluster? - I think that answers to these questions are relevant information that should validate the decision taken.

Thanks a lot in advance!

accorvin commented 1 year ago

@andreaTP thanks for the questions! I manage the core team developing Data Science Pipelines.

@opendatahub-io/data-science-pipelines-maintainers can you guys chime in here?

rimolive commented 1 year ago

@andreaTP Thank you so much for the questions!

Our decision with Kubeflow Pipelines came from the following assumptions:

Open Data Hub operator, as well as components like notebook controller and model serving, are based on Kubeflow. So the decision to go with Kubeflow Pipelines is because we are already working on Kubeflow community.
Kubeflow, among the other pipeline engines, is the only one we found with an MLOps focus. The Experiment tracking, as well as Model metadata and artifacts provenance as Kubeflow Pipelines features, makes it fits better on our MLOps strategy.
Kubeflow Pipelines is Cloud-Native, different from other solutions.
Although Kubeflow Pipelines uses Argo as the pipeline backend, there is a fork maintained by IBM and part of the Kubeflow distribution that uses Tekton, which fits perfectly with OpenShift Pipelines/Tekton.

As for the minimum requirements and proposed configurations, the reference you sent is very outdated (it came from v0.6 of the Kubeflow docs, we are currently using v1.6) but let me explain how we implemented the solution. We created an operator to deploy the whole stack in multiple namespaces. We compared single shared stack vs. multiple stacks, one per namespace, and we decided to go over the multiple stacks (The ADR describes some of the alternatives we considered). When we say "stack", we mean the whole Kubeflow Pipelines installation including the database and object store, but these components can be external services that the stack can use. We found other issues that a single stack will make Data Science Pipelines more complicated to use, including security issues. We also ran a perf test where we deployed 2k stacks and the resource consumption for the operator seemed reasonable to us.

If we compare the minimum requirements in that link, maybe it's not a valid comparison with our solution because:

The installation instructions for a multi-user pipeline configuration require the full Kubeflow installation, so I'm not 100% sure if the requirements are valid for the distribution installation or only the pipeline component.
If we consider only the pipelines in the minimum requirements, then we need to account for components like the database and the object store, as well as Istio (required for the multi-user configuration). Our stack does not require Istio, and the database/object store can be external resources, which makes the stack consume fewer resources.
At last, I'm unsure if the minimum requirements also include Argo, but honestly, all we have in our solution is just the API Server and the auxiliary pods (Persistence Agent and Scheduled Workflow watchers). Having that said, this seems to scale better in user clusters considering that Tekton concentrates the whole pipeline workload. That said, we can consider that Tekton minimum requirements are mandatory over the rest of the stack in terms of resource consumption.

Hope that clarifies how we decided with Kubeflow Pipelines and the solution we decided to implement.

andreaTP commented 1 year ago

Hi @rimolive thanks a ton for taking the time to share those answers!

This sheds light on the motivations and background work supporting the decisions, let me ask a few follow-up questions to ensure I understood the full picture 🙂 .

the decision to go with Kubeflow Pipelines is because we are already working on Kubeflow community

Here I read that Kubeflow has been considered a "natural fit" does this means that no other technology has been evaluated in this context?

Cloud-Native, different from other solutions

Can you expand on the "other solutions" compared?

Argo vs. Tekton

This is a very interesting decision! Any document I can look up on the motivations using one vs the other?

As for the minimum requirements and proposed configurations, the reference you sent is very outdated

Do you have a reference for updated numbers?

We also ran a perf test where we deployed 2k stacks and the resource consumption for the operator seemed reasonable to us.

This sounds great! Where can I find more information about it? Do you have a private or public repository collecting the used setup and how the test has been executed? Have you collected any data during the experiment? Even a simple kubectl top nodes would give pretty valuable information!

If we compare the minimum requirements in that link, maybe it's not a valid comparison with our solution

Fair, is there a plan to have an updated estimation?

Thanks a lot in advance, your answers are really appreciated!

rimolive commented 1 year ago

Here I read that Kubeflow has been considered a "natural fit" does this means that no other technology has been evaluated in this context?

Not sure how long have you been following our roadmap, but we tried to bring Airflow to ODH components list. Airflow, along with Argo, were the ones we considered before kfp. Because those weren't cloud-native solutions at the time we were evaluating options, in addition to the fact that kfp was more focused on MLOps tasks, made us to decide for kfp.

Can you expand on the "other solutions" compared?

See my previous answer

Any document I can look up on the motivations using one vs the other?

I don't know if we have publicly documented it somewhere. I'll check if we have, and share in this issue.

Do you have a reference for updated numbers?

Unfortunately, no. This was the numbers collected by kubeflow team, and since we have a different configuration we expect to run these perf tests.

Where can I find more information about it? Do you have a private or public repository collecting the used setup and how the test has been executed? Have you collected any data during the experiment?

I'll check that info and share it in this issue.

Fair, is there a plan to have an updated estimation?

We'd like to run a perf test to verify the current configuration constraints, but the engineering team has other priorities right now, such as integrate the rest of the kfp components and v2 migration when GA is released.

opendatahub-io / architecture-decision-records

[Questions] Regarding Data Science Pipelines #11