kedro-org / kedro-devrel

Kedro developer relations team use this for content creation ideation and execution
Apache License 2.0
0 stars 3 forks source link

Conduct external market research #94

Closed astrojuanlu closed 6 months ago

astrojuanlu commented 1 year ago

Objective: Assess perceptions of Kedro and competitors.

astrojuanlu commented 10 months ago

Postponing this for now.

astrojuanlu commented 9 months ago

Expanding the scope of this to

  1. Reassess the evolution of the more established Kedro competitors (DVC and MLflow gained pipelines, dbt gained Python support)
  2. Evaluate nascent competitors (Hamilton, Databricks bundles, brickflow)
  3. Understand Kedro's connection with other pieces of a typical MLOps stack (feature stores, experiment tracking solutions, data & model observability)
  4. Explore the current status of using Kedro with large structured and semi-structured data

Possibly intersecting with https://github.com/kedro-org/kedro/issues/3012

astrojuanlu commented 8 months ago

More axis worth exploring. All of my "conclusions" here are preliminary and should be starting points for further exploration.

Data orchestration vs workflow orchestration

I contend that Kedro is a great data orchestrator (allow me to abuse the term "orchestrator" here to refer to pipelines) but a not so good workflow orchestrator. In fact, we've seen time and time again how users use "dummy datasets" to artificially connect two nodes that aren't otherwise connected with the goal of controlling the execution order.

Is this something Kedro should improve? Or should it continue to stay away from workflow orchestration?

Data pipelines

Speaking of ETL vs ELT, I contend that Kedro is an excellent framework if you're doing ETL, less so if you're doing ELT. Why? Because ELT sort of assumes direct storage of structured data on a data warehouse, and structured data is very amenable to SQL. Many teams will want to use ELT with Python though, and Kedro will serve them well.

image

Machine learning pipelines

Following Hopsworks' FTI (Feature, Training, Inference) mental map, I contend that Kedro is perfect for Feature and Training pipelines, but not very useful for Inference pipelines (which are basically model serving).

image

This mental map, by the way, greatly helps make sense of architecture diagrams like these:

Screenshot 2023-10-16 at 20-42-05 Build your MLOps stack MyMLOps Screenshot 2023-10-16 at 20-42-19 ml-ops org

(https://ml-ops.org/content/state-of-mlops, https://mymlops.com/)

What do data practitioners care about?

There's sufficient evidence that data scientists (or, to avoid somewhat outdated categorizations, "machine learning scientists") don't care about orchestration or pipelines. They do care about data modelling, statistical significance, confounding factors, experiment tracking, and many other things.

image

(strawman proposal of a "how much data scientists care" pyramid, originally from https://venturebeat.com/business/mlops-vs-devops-why-data-makes-it-different/ then reproduced in https://outerbounds.com/metaflow/)

So, if "data scientists" don't care about orchestration, how do we serve them well? And what do data engineers and machine learning engineers care about?

The "infant mortality" problem of ML

Some (a few? many?) models don't make it to production ("early failures" in the Bathtub curve). But is that a bad thing? Or a natural result of the experimentation process?

image

(https://ml-ops.org/content/crisp-ml)

And if it's a natural result, does it constitute a problem worth solving?

astrojuanlu commented 8 months ago

And one last thing I forgot

Batch vs streaming

Kedro is not a streaming system. If anything, it can simulate streaming like most people do: using a micro-batch approach. But Kedro startup times are notoriously high https://github.com/kedro-org/kedro/issues/1476 so the latency would be noticeable.

astrojuanlu commented 8 months ago

from The State of Applied Machine Learning 2023 https://resources.tecton.ai/the-state-of-applied-machine-learning-2023-report 1700+ respondents during the month of February.

very good insights. defines 5 pieces of an MLOps stack:

  1. model serving
  2. model registry and versioning
  3. feature store
  4. model monitoring
  5. data monitoring

The Feature Store / Feature Platform and Monitoring & Observability components will see the largest increases (~43 percentage points increase for both) in adoption in the next 12 months ... Nearly 70% of respondents say they either have or plan to have a central MLOps platform in the next 12 months

More:

Respondents who shared that their companies have only batch models in production also shared that they struggle more with simpler organizational problems, such as demonstrating business ROI (41.5%) and lack of engineering and data science resources (21.5% and 24.8%, respectively) Meanwhile, respondents who shared that their companies have real-time models in production struggle more with “advanced” challenges, such as collaboration between engineering and data science teams (28.0%) and serving models with enterprise SLAs (21.5%)

also, "building production data pipelines" was the second most cited challenge for both groups.

on the other hand:

Deploying a new model to production is a long process (>1 month for 65.0% of respondents and >3 months for 31.7%) 71.4% of respondents shared that their companies aim to improve deployment time by at least 10% in the next 12 months.

but (1) it doesn't explain why or what does "in production" entail! and (2) a 10 % improvement doesn't seem like a particularly ambitious target to me (only 30 % want to make it 50 % faster, only 3.6 % want to make it 2x faster). a 10 % improvement sounds to me like incremental progress = not a bottleneck.

more insights:

astrojuanlu commented 8 months ago

from https://www.comet.com/site/ty/report-2023-machine-learning-practitioner-survey/ "41% of their machine learning experiments had to be scrapped", mainly due to "API integration errors (26%), lack of resources (25%), inaccurate or misrepresentative data (25%) and manual mismanagement (25%)" and "machine learning practitioners surveyed say it takes their team seven months to deploy a single machine learning project"

and https://imerit.net/the-2023-state-of-mlops-report/ "Data’s often the culprit for model failures", "when evaluating the reason for the failure of ML projects, almost half of professionals (46%) said lack of data quality or precision was the number-one reason, followed by a lack of expertise")

astrojuanlu commented 7 months ago

https://learn.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2#which-azure-pipeline-technology-should-i-use

Azure also separating data pipelines from machine learning pipelines.

image

astrojuanlu commented 6 months ago

Splitted the research in two: data pipelines (ETL/ELT) and machine learning pipelines.

Data pipelines

Tool survey from August 2021 on Reddit (n=597) https://www.reddit.com/r/dataengineering/comments/pbaw2f/what_etl_tool_do_you_use/?utm_source=share&utm_medium=web2x&context=3

Pasted image 20231210235631

Another survey from 2023 (n=189, 89% were Metabase customers) https://www.metabase.com/data-stack-report-2023/#data-ingestion-in-house

Pasted image 20231211000044

In conclusion:

Machine learning pipelines

"Pipelines are a buzzword" https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie2lai3/?utm_source=share&utm_medium=web2x&context=3, https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie1xdro/?utm_source=share&utm_medium=web2x&context=3 and "pipelines are just automation of data processing" https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie2a1xc/?utm_source=share&utm_medium=web2x&context=3

Hence "machine learning pipelines" is basically MLOps, or a subset of it https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

image

The one tool that was consistently mentioned was MLFlow, not only for experiment tracking but as a broader MLOps solution.

According to the Tecton report, commercial MLOps platforms are much more widespread than open source solutions:

image

with adoption numbers reflecting broader trends on cloud market share https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/

image

So in this space there's no clear winner either, but it's evident that commercial platforms win open source solutions.

astrojuanlu commented 5 months ago

Adding one more interesting industry survey about data engineering https://seattledataguy.substack.com/p/the-state-of-data-engineering-part-b61

image