opendatahub-io / architecture-decision-records

Collection of Architectural Decision Records
Apache License 2.0
13 stars 26 forks source link

[Questions] Regarding Data Science Pipelines #11

Open andreaTP opened 1 year ago

andreaTP commented 1 year ago

Hi all! It's amazing that we can actually look up the ADRs in this repository! Thanks a lot for the openness! πŸ™

I was going through this one: https://github.com/opendatahub-io/architecture-decision-records/blob/main/ODH-ADR-0002-data-science-pipelines-multi-user-approach.md and I have some follow-up questions, I hope this is the right place πŸ™‚

integrating Data Science Pipelines (upstream = Kubeflow Pipelines)

Here I read that Kubeflow is the technology to be used, is it a requirement? An assumption? Or have we evaluated alternatives and decided to use Kubeflow? In the latter case, I'm super interested in having access to the comparison!

we propose to roll out multiple individual single-user Kubeflow Pipelines stack into multiple namespaces

On the Kubeflow documentation, I can find that the minimum requirements are pretty significant, do we have estimations of the system requirements in the proposed configurations(e.g. shared vs. local minio and postgres) and adding the operator needs? how much do we expect this to scale on a user's cluster? Have we explored alternatives or do we expect the users to use a single installation in a dedicated namespace on each cluster? - I think that answers to these questions are relevant information that should validate the decision taken.

Thanks a lot in advance!

accorvin commented 1 year ago

@andreaTP thanks for the questions! I manage the core team developing Data Science Pipelines.

@opendatahub-io/data-science-pipelines-maintainers can you guys chime in here?

rimolive commented 1 year ago

@andreaTP Thank you so much for the questions!

Our decision with Kubeflow Pipelines came from the following assumptions:

As for the minimum requirements and proposed configurations, the reference you sent is very outdated (it came from v0.6 of the Kubeflow docs, we are currently using v1.6) but let me explain how we implemented the solution. We created an operator to deploy the whole stack in multiple namespaces. We compared single shared stack vs. multiple stacks, one per namespace, and we decided to go over the multiple stacks (The ADR describes some of the alternatives we considered). When we say "stack", we mean the whole Kubeflow Pipelines installation including the database and object store, but these components can be external services that the stack can use. We found other issues that a single stack will make Data Science Pipelines more complicated to use, including security issues. We also ran a perf test where we deployed 2k stacks and the resource consumption for the operator seemed reasonable to us.

If we compare the minimum requirements in that link, maybe it's not a valid comparison with our solution because:

Hope that clarifies how we decided with Kubeflow Pipelines and the solution we decided to implement.

andreaTP commented 1 year ago

Hi @rimolive thanks a ton for taking the time to share those answers!

This sheds light on the motivations and background work supporting the decisions, let me ask a few follow-up questions to ensure I understood the full picture πŸ™‚ .

the decision to go with Kubeflow Pipelines is because we are already working on Kubeflow community

Here I read that Kubeflow has been considered a "natural fit" does this means that no other technology has been evaluated in this context?

Cloud-Native, different from other solutions

Can you expand on the "other solutions" compared?

Argo vs. Tekton

This is a very interesting decision! Any document I can look up on the motivations using one vs the other?

As for the minimum requirements and proposed configurations, the reference you sent is very outdated

Do you have a reference for updated numbers?

We also ran a perf test where we deployed 2k stacks and the resource consumption for the operator seemed reasonable to us.

This sounds great! Where can I find more information about it? Do you have a private or public repository collecting the used setup and how the test has been executed? Have you collected any data during the experiment? Even a simple kubectl top nodes would give pretty valuable information!

If we compare the minimum requirements in that link, maybe it's not a valid comparison with our solution

Fair, is there a plan to have an updated estimation?

Thanks a lot in advance, your answers are really appreciated!

rimolive commented 1 year ago

Here I read that Kubeflow has been considered a "natural fit" does this means that no other technology has been evaluated in this context?

Not sure how long have you been following our roadmap, but we tried to bring Airflow to ODH components list. Airflow, along with Argo, were the ones we considered before kfp. Because those weren't cloud-native solutions at the time we were evaluating options, in addition to the fact that kfp was more focused on MLOps tasks, made us to decide for kfp.

Can you expand on the "other solutions" compared?

See my previous answer

Any document I can look up on the motivations using one vs the other?

I don't know if we have publicly documented it somewhere. I'll check if we have, and share in this issue.

Do you have a reference for updated numbers?

Unfortunately, no. This was the numbers collected by kubeflow team, and since we have a different configuration we expect to run these perf tests.

Where can I find more information about it? Do you have a private or public repository collecting the used setup and how the test has been executed? Have you collected any data during the experiment?

I'll check that info and share it in this issue.

Fair, is there a plan to have an updated estimation?

We'd like to run a perf test to verify the current configuration constraints, but the engineering team has other priorities right now, such as integrate the rest of the kfp components and v2 migration when GA is released.