Closed astrojuanlu closed 6 months ago
Postponing this for now.
Expanding the scope of this to
Possibly intersecting with https://github.com/kedro-org/kedro/issues/3012
More axis worth exploring. All of my "conclusions" here are preliminary and should be starting points for further exploration.
I contend that Kedro is a great data orchestrator (allow me to abuse the term "orchestrator" here to refer to pipelines) but a not so good workflow orchestrator. In fact, we've seen time and time again how users use "dummy datasets" to artificially connect two nodes that aren't otherwise connected with the goal of controlling the execution order.
Is this something Kedro should improve? Or should it continue to stay away from workflow orchestration?
Speaking of ETL vs ELT, I contend that Kedro is an excellent framework if you're doing ETL, less so if you're doing ELT. Why? Because ELT sort of assumes direct storage of structured data on a data warehouse, and structured data is very amenable to SQL. Many teams will want to use ELT with Python though, and Kedro will serve them well.
Following Hopsworks' FTI (Feature, Training, Inference) mental map, I contend that Kedro is perfect for Feature and Training pipelines, but not very useful for Inference pipelines (which are basically model serving).
This mental map, by the way, greatly helps make sense of architecture diagrams like these:
(https://ml-ops.org/content/state-of-mlops, https://mymlops.com/)
There's sufficient evidence that data scientists (or, to avoid somewhat outdated categorizations, "machine learning scientists") don't care about orchestration or pipelines. They do care about data modelling, statistical significance, confounding factors, experiment tracking, and many other things.
(strawman proposal of a "how much data scientists care" pyramid, originally from https://venturebeat.com/business/mlops-vs-devops-why-data-makes-it-different/ then reproduced in https://outerbounds.com/metaflow/)
So, if "data scientists" don't care about orchestration, how do we serve them well? And what do data engineers and machine learning engineers care about?
Some (a few? many?) models don't make it to production ("early failures" in the Bathtub curve). But is that a bad thing? Or a natural result of the experimentation process?
(https://ml-ops.org/content/crisp-ml)
And if it's a natural result, does it constitute a problem worth solving?
And one last thing I forgot
Kedro is not a streaming system. If anything, it can simulate streaming like most people do: using a micro-batch approach. But Kedro startup times are notoriously high https://github.com/kedro-org/kedro/issues/1476 so the latency would be noticeable.
from The State of Applied Machine Learning 2023 https://resources.tecton.ai/the-state-of-applied-machine-learning-2023-report 1700+ respondents during the month of February.
very good insights. defines 5 pieces of an MLOps stack:
The Feature Store / Feature Platform and Monitoring & Observability components will see the largest increases (~43 percentage points increase for both) in adoption in the next 12 months ... Nearly 70% of respondents say they either have or plan to have a central MLOps platform in the next 12 months
More:
Respondents who shared that their companies have only batch models in production also shared that they struggle more with simpler organizational problems, such as demonstrating business ROI (41.5%) and lack of engineering and data science resources (21.5% and 24.8%, respectively) Meanwhile, respondents who shared that their companies have real-time models in production struggle more with “advanced” challenges, such as collaboration between engineering and data science teams (28.0%) and serving models with enterprise SLAs (21.5%)
also, "building production data pipelines" was the second most cited challenge for both groups.
on the other hand:
Deploying a new model to production is a long process (>1 month for 65.0% of respondents and >3 months for 31.7%) 71.4% of respondents shared that their companies aim to improve deployment time by at least 10% in the next 12 months.
but (1) it doesn't explain why or what does "in production" entail! and (2) a 10 % improvement doesn't seem like a particularly ambitious target to me (only 30 % want to make it 50 % faster, only 3.6 % want to make it 2x faster). a 10 % improvement sounds to me like incremental progress = not a bottleneck.
more insights:
from https://www.comet.com/site/ty/report-2023-machine-learning-practitioner-survey/ "41% of their machine learning experiments had to be scrapped", mainly due to "API integration errors (26%), lack of resources (25%), inaccurate or misrepresentative data (25%) and manual mismanagement (25%)" and "machine learning practitioners surveyed say it takes their team seven months to deploy a single machine learning project"
and https://imerit.net/the-2023-state-of-mlops-report/ "Data’s often the culprit for model failures", "when evaluating the reason for the failure of ML projects, almost half of professionals (46%) said lack of data quality or precision was the number-one reason, followed by a lack of expertise")
Azure also separating data pipelines from machine learning pipelines.
Splitted the research in two: data pipelines (ETL/ELT) and machine learning pipelines.
Tool survey from August 2021 on Reddit (n=597) https://www.reddit.com/r/dataengineering/comments/pbaw2f/what_etl_tool_do_you_use/?utm_source=share&utm_medium=web2x&context=3
Another survey from 2023 (n=189, 89% were Metabase customers) https://www.metabase.com/data-stack-report-2023/#data-ingestion-in-house
In conclusion:
"Pipelines are a buzzword" https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie2lai3/?utm_source=share&utm_medium=web2x&context=3, https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie1xdro/?utm_source=share&utm_medium=web2x&context=3 and "pipelines are just automation of data processing" https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie2a1xc/?utm_source=share&utm_medium=web2x&context=3
Hence "machine learning pipelines" is basically MLOps, or a subset of it https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
The one tool that was consistently mentioned was MLFlow, not only for experiment tracking but as a broader MLOps solution.
According to the Tecton report, commercial MLOps platforms are much more widespread than open source solutions:
with adoption numbers reflecting broader trends on cloud market share https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/
So in this space there's no clear winner either, but it's evident that commercial platforms win open source solutions.
Adding one more interesting industry survey about data engineering https://seattledataguy.substack.com/p/the-state-of-data-engineering-part-b61
Objective: Assess perceptions of Kedro and competitors.