kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.02k stars 906 forks source link

Deployment: UX Research Playback #4325

Open iamelijahko opened 1 week ago

iamelijahko commented 1 week ago

Why We are Doing the Research?

We know deployment is challenging because the ecosystem is constantly evolving, and industry trends shift rapidly. However, with the growing maturity of machine learning practices across the industry, we believe that Kedro’s next big milestone is to address the deployment issue. Solving this would truly set Kedro apart from other frameworks.

Through this research, we aim to understand what users mean by ‘deployment’—their goals, needs, and processes—and how they currently deploy Kedro pipelines. By uncovering these insights, we can focus on building Kedro features specifically for deployment, improve our plugins to better meet user needs, and refine our documentation to guide users more effectively.

Ultimately, this research is about empowering our users and helping Kedro stand out as a framework that makes deployment smoother and more achievable.

Research Approach

Our Kedro deployment research involved 10 user interviews across three user groups to capture diverse perspectives and deployment needs. Alongside the interviews, we also received 40 survey responses to enrich our understanding of user experiences.

This research took place over three weeks, allowing us to dive deep into deployment workflows, gather key insights, and identify areas for improvement.

Users and Platforms

Starting with user roles, we see that nearly half of our users, 47.5%, are data scientists. Machine learning engineers make up the next largest group at 22.5%, followed by data engineers at 20%. The remaining 10% fall into the 'Others' category, which includes roles like architects, platform engineers, and MLOps specialists.

Moving to the tools they use, Databricks and Docker stand out as the two most popular platforms for deployment. Following them, we have Google Cloud and AWS SageMaker, which serve users focused on cloud-based deployments. Argo, Azure Data Factory, and Kubeflow also feature but are less commonly used.

Finally, 60% of our users rely on CI/CD automation for deployment. This shows a significant commitment to streamlined, repeatable deployment processes, emphasizing the importance of efficiency and consistency in their workflows.
image
Based on 40 survey responses - Survey Participants' Roles: Data Scientists (19), Machine Learning Engineers (9), Data Engineers (8), Architects (1), Data Analysts (1), MLOps Engineers (1) - Kedro Experience: 1–2 years (15), 3–5 years (14), Less than 1 year (6), More than 5 years (5) - Platforms: Databricks (17), Docker (14), Google Vertex AI (3), Airflow (3), AWS SageMaker (2), and Others - Kubernetes (1), Argo Workflows (1), Python Packages (1), Azure Data Factory (1), Kubeflow (1), Snowflake (1), Azure ML (1)

User Groups

Based on 10 user interviews with 3 user groups User groups 1. Databricks deployment Kedro users 2. Other platforms deployment Kedro users 3. Users opt out of Kedro for deployment
Deployment definition Reproducible and automated CI/CD process with Kedro projects packaged as .whl or Docker images Moving pipelines and datasets from DEV → TEST → PROD with continuous inference System for handling large datasets in batch processing with fine-grained control over pipeline execution
Needs Data catalog configuration
Parameter management
Version control
Infrastructure setup
Node grouping
Data Access
Secrets configuration
Environment variables
Production systems for LLM to trigger Kedro pipelines via API
Platforms Databricks Amazon SageMaker
Argo Workflows
Airflow
Kubeflow
Google Cloud Platform

Unified Deployment Journey

Based on 10 user interviews with 3 user groups
image

Key Insights

Plugins Node grouping Kedro-Databricks Deployment Online-Inference Deployment Container / Dependency Management
Insights Users relying on Kedro's connection plugins for third-party platforms face outdated or compatibility issues, making the conversion of Kedro nodes into platform components challenging and leading them to seek alternative solutions. Users value merging multiple nodes into a single task on the deployment platform for clarity and efficiency, but current plugins provides limited functionality. Users deploy Kedro projects on Databricks in two ways: longer methods that generate a .whl file on DBFS, and quicker methods that make project code directly accessible in Databricks repo with the options of running in notebooks. Users are increasingly seeking to deploy online inference pipelines (such as LLM calls) in isolated environments for real-time predictions; however, Kedro offers limited support for this functionality. Users often deploy with Docker images, but for larger projects, using a single container for the entire project can be inefficient.
Quotes "We were exploring how can we use a Kubeflow plugin to compile a Kedro pipeline and deploy on Kubeflow services."

"We opted not to use Kedro-Databricks plugin, as manually interacting with the Databricks API via Asset Bundles gives us greater flexibility."
"Combining nodes into single tasks improves overview, but we currently have to manually group them in Databricks." "Code stored in GitHub is packaged as a .whl with CI/CD via GitHub Actions and deployed to Databricks file system using Databricks Asset Bundles."

"I use GitHub Enterprise with CI/CD to sync each push from DEV to MAIN, deploy Databricks jobs to Azure workflows with Terraform."

"We use the VSCode-Databricks extension to keep the local codebase and Databricks repo synchronized."
"Real-time inference is managed separately through Azure Data Explorer."

"The inference pipeline operates continuously in PROD, processing live data for real-time predictions."

"...you're five years behind. I think API wrapping is becoming common... using an LLM to call a Kedro pipeline and create an API would be a great idea."
"How easy it is to split dependencies? Doesn’t the pipeline registry end up importing everything? "
Painpoints "The most difficult part... is translating our Kedro pipelines to Argo Workflows manifests."

"Kedro-SageMaker plugin was considered for client data storage but was not used due to incompatibility with Kedro 0.19."

"I rely on Kedro-Kubeflow plugins, which are outdated and maintained by community members."
"We can convert a single node to a Kubeflow Component, but deploying 400 nodes as separate containers adds complexity."

"Running each Kedro node in a separate container could make a small node execute in one or two seconds, but Argo’s longer pod startup time would make this inefficient."
"Kedro-Databricks plugin creates one task per node. For pipelines with a large number of nodes, this is impractical."

"Customizing deployment often requires working directly with the REST API, but configurations don’t always align with Databricks Asset Bundles."
"You can call this via an API, which will run the pipeline and return the output. I haven't found a proper way to do this."

"I’d prefer not to have YAML parameters and catalog registration... Maybe Kedro isn't the right tool."
“Some Kedro projects have too messy dependencies, the container becomes too big”

"We have one container for the whole project... Having to install Java as well as the whole PySpark library and some jars specifically related to Hadoop and AWS to read from S3—that’s huge."
Opportunities How can we ensure third-party connection plugins remain reliable with long-term compatibility and timely updates? How can we design node grouping—by tags, namespaces, pipelines, or other methods—to maximize usefulness for users? How can we help users discover the most suitable and straightforward method for deploying on Databricks? How can we provide users with a seamless experience for online-inference pipeline deployments, with API support, dynamic parameter and catalog management? How can we optimize deployment for large Kedro projects to avoid inefficiencies associated with running everything from a single container?
Directions Standardize and maintain the conversion process, while preserving existing API interactions for uploading and running pipelines.

Enable automated deployment to reduce reliance on manual CI/CD steps.

(Note: Each connection plugin includes (1) converting Kedro pipelines for different platforms and (2) API interactions to upload and execute the converted pipelines.)
Centralize grouping functionality within the framework instead of developing it separately for each plugin. Update Kedro-Databricks deployment guide, explaining options at each step to offer flexibility and clarity for various deployment needs.

Add a documentation section on simplified notebook-based deployment and using the VS Code plugin.
Add clear online deployment recommendations to our documentation, with links to relevant plugins and successful use cases.

Improve dynamic parameter handling and catalog management in Kedro.
Consider redesigning the micropackaging concept to allow independent dependency management, containerization, and execution for each pipeline.
datajoely commented 1 week ago

❤️