Closed jorgeorpinel closed 9 months ago
Do we need a development/pipelines-related use case? We have https://dvc.org/doc/use-cases/versioning-data-and-model-files, which addresses model development but focuses on versioning and not pipelines. My model development may include data validation and preprocessing followed by model training and evaluation, and I iteratively update data, add features, tune models, etc. Pipelines can help compose this as a dag with distinct stages where I can easily and efficiently execute the pipeline and run only the necessary stages when I make changes.
model development may include data validation and preprocessing followed by model training and evaluation... compose this as a dag where I can easily and efficiently run only the necessary stages
Good catch! And in fact that probably goes even "before" experiment mgmt or production envs/ MLOps.
iteratively update data, add features, tune models, etc
This overlaps with experiment management. Which is fine. But if it's too much we can leave it for the Exps-related use case (just mention).
UPDATE: Added to description
AirFlow (e.g. batch scoring) ... End-to-end scenario
Cc @mnrozhkov I know you've worked quite a bit on this topic. So just pinging you here for visibility
p.s. our docs use cases are not enterprise-level so far, rather high-level and short. If you'd be interested in drafting one around these topics using your existing material please lmk!
Guys I'm giving this priority again per our current roadmap (now that #2587 is basically finished). I think Experiment Management is the most needed topic now, and along the lines @iesahin and I are working on (rel. #2548). But if anyone thinks another direction should have higher priority please comment.
And if we agree on Exp Mgmt. What should be the spin? i.e. user perspective problem/solution and key concepts. I discussed briefly with @shcheklein and we think it could be centered around running and managing rapid iterations in DS projects (without Git overhead) and concepts bookkeeping, hyperparameters, metrics, visualization.
What do you think? Cc @dberenbaum @flippedcoder @jendefig @casperdcl @tapadipti @dmpetrov @pmrowla
Bookkeeping + visualization seems the most relevant path to follow. Something along the lines of "push experiments to a central repository and see their comparative plots."
Some ideas for 3 (re - production environments/ MLOps)
path from development to production could be better... as a mode of operation I would favor a model where runs (e.g. artifacts, metrics, params, etc.) are pushed to production from a development environment. I am arguing for a model like git with remotes... where runs are captured locally first and then if confirmed a run can be pushed to a remote server. A model like this just keeps things more tidy... authentication could also be directly supported to make it easier to deploy for production... For more production-oriented organizations ... for example production model monitoring
From https://megagon.ai/blog/whatmlflowsolvesanddoesntforus/
Interesting diagram inspiration for 1.3 or 1.4
From https://medium.com/google-cloud/migrate-kedro-pipeline-on-vertex-ai-fa3f2c6f7aad
4.3 Production Integrations Databases (e.g. SQL dump versioning/preprocessing) Spark (e.g. remote training) AirFlow (e.g. batch scoring) Kafka (e.g. real-time predictions)
1. Data Management
2. Data Pipeline development
3. Experiment Management
Preliminary ideas:
Hyperspace exploration [Tuning/Optimization] ? May be too low levelThere's a blog about this now.BookkeepingTracking (with Git): Rapid iterations. UPDATE: https://github.com/iterative/dvc.org/pull/2782exp
+machine
+CML?)4. Production environments/ MLOps
4.1 DVC in Production Training remotely Deploying models (CLI or API) Keep pipelines, artifacts in sync between environments Batch scoring a.k.a. "DVC for ETL" - see https://github.com/iterative/dvc.org/issues/2512#issuecomment-854999981 + Distributed/parallel computing
4.2 ML Model Registry Model lifecycle (training, shadow, active, inactive) Automated/Continuous training (remotely) Discovery and reusability Deploying models Batch scoring example + Real-time inference
4.3 Production Integrations Databases (e.g. SQL dump versioning/preprocessing) Spark (e.g. remote training) AirFlow (e.g. batch scoring) Kafka (e.g. real-time predictions)
4.4 End-to-end scenario with a combination from above, e.g.: Importing data from Spark Training remotely Model Registry Ops Batch scoring (AirFlow integration)