Closed jorgeorpinel closed 3 years ago
And a side question: is this direction a higher priority than Experiments-related use cases atm? (see #2270)
Some thoughts:
Deploying models for real-time inference
- yep, feel too narrow, need to find a better angleModel zoo
- is too high level concept I think (model zoo is close to a product). I think we can start with a model registry
? Some ideas for this list:
Model Management
and/or Model Lifecycle
- explain DVC from the models angle - we capture all information that is relevant to models - data, weights, metrics, experiments - and allow people to navigateModel Registry
- discovery and reusabilityExperiments tracking
/management - here we should sell W&B, MlFlow, etc - rapid iterations, live metrics + other metrics + navigationtitle is confusing cases: MLOps direction
for me ... why not just - cases: list of use case to write ?
OK we're going to try to make this into a spike to come up with actionable items within 7 days or less hopefully. Please help if you can guys. I'll tag people via chat... β
It might help to start with thesis statements instead of topics. Thesis statements would be like single-sentence use cases arguing for the utility of the products in given scenarios. Use cases are more persuasive writing compared to the explanatory writing of other docs, so a topic may not clarify what we plan to say about it. This will probably take more time and debate, but hopefully we will have more clarity in deciding which use cases to pursue and in writing the use cases. What do you think?
title is confusing cases: MLOps direction for me ... why not just - cases: list of use case to write ?
Because we also have cases: Experiments #2270. That seemed like a totally different direction from all the previous ideas summarized here (mainly from #820), which I think at least somewhat relate to MLOps? Happy to change the title but this is not the a comprehensive list of use case ideas in all possible product directions.
Model Management and/or Model Lifecycle - explain DVC from the models angle - we capture all information that is relevant to models - data, weights, metrics, experiments - and allow people to navigate Model Registry - discovery and reusability
I have a feeling that model registries aren't different enough from data registries to write another full use case on that. But maybe it can be part of a Model Mgmt/Lifecycle use case. I like that idea! It could also cover or mention some of the topics above (training remotely, deployment, real-time predictions).
Happy to change the title but this is not the a comprehensive list of use case ideas in all possible product directions.
The way I initially understood the title cases: new directions
and the meaning of this research is to consolidate all possible ideas (w/o this split - experiments, ml models - which is hard for me to understand tbh - e.g. why experiments are not about models?).
The title for that ticket you mention about experiments was about one specific use case to my mind.
I have a feeling that model registries aren't different enough from data registries to write another full use case on that.
it's a matter of what we are optimizing here. I would not be trying to generalize by sacrificing the initial goal - more people come, see the high level title that resonates with them . It's fine that they will overlap internally.
In this specific case - I think model registry can be significantly different.
why experiments are not about models
Sure, it all connects. But here I'm thinking mostly about solutions for deploying and using ml models via DVC/CML e.g. production environments, model deployment, etc. Sorry for the confusion...
So it looks like so far the better-defined scenarios are
It might help to start with thesis statements instead of topics β single-sentence use cases arguing for the utility of the products in given scenarios.
@dberenbaum
Keep in mind a) this is not my area of expertise and b) this is based on preliminary understanding of the proposals, so my explanations above may be inexact.
Thanks, @jorgeorpinel! I didn't mean to suggest that you should bear responsibility for developing each thesis statement, or that each one needs to be perfected.
1. you can use DVC and CML to deploy ml models to production, and sync back the model learning to your development env/team
We have a few use case ideas around "production" and/or "deployment," and it's not clear to me what they mean. There are different scenarios that I have seen described as production deployments: a. Automated training: Run a scheduled, automated training pipeline to keep your model updated with the latest data (this seems to be #862). The retrained model might then be used for the scoring scenarios below. b. Batch scoring: Run a scheduled, automated scoring pipeline to always have updated predictions. c. Real-time scoring: Submit data as needed to an API that returns model scores (see #2431).
I'd probably vote to focus on b
since a solution for c
might not be fully developed yet. a
could maybe be included as part of it if it's not too complex, but to me it's being covered by the CI/CD use case in development.
3\. DVC helps you develop and manage ml models throughout their whole lifecycle (needs detailing)
As @shcheklein has mentioned, this can either be about a single model or many models, which might be different use cases.
For a single model, track, visualize, and analyze everything about your experiment, including code, parameters, metrics, plots, data, training DAG, and any other artifacts included in your repo.
For many models, try many different experiments and track them, enabling you to compare, select, reproduce, and iterate on any experiments.
a. Automated training
This could or could not be considered related to "in production". Training somewhere seems rather like a pre-requisite. I think it has more to do with CI/CD (which can be part of a prod deployment workflow, so there's overlap). This can probably be covered initially in #2404 indeed. Cc @casperdcl
b. Batch scoring
Is this basically ETL where E=get chunk of data, T=run pre-trained model, L=store/upload scores ? That could be part of a use case but may still not be high-level enough.
c. Real-time scoring
Not sure I get how DVC play a part in this. Probably just in the way to deploy the model (e.g. via the DVC API which would be similar to this -- going back to the "model registry" idea). Still not high-level enough IMO but b and c def. seem related.
ml model lifecycle/management
this can either be about a single model or many models
Hmmm... By many models do you mean actually different models with different goals (would relate with "model registry'), or multiple versions of a same model in development? I usually assume the typical ML pipeline/project ends up in a single model.
BTW can we clarify what we mean by "model lifecycle"? Maybe training, active, inactive (related to "in production") or planning, data eng, modeling (much broader topic). Cc @shcheklein
initial goal - more people come, see the high level title that resonates with them . It's fine that they will overlap
Going back to this (which is why titles are important too), I think "DVC in Production" is a really good umbrella concept to begin with, keeping in mind it would be the first use case in this direction. It can have a story (maybe sections) that cover several of the scenarios we've discussed above. Later on we could split into multiple use cases if that's better. WDYT?
UPDATE: See quick draft (idea) in #2506
Is this basically ETL where E=get chunk of data, T=run pre-trained model, L=store/upload scores ? That could be part of a use case but may still not be high-level enough.
Yup, although T could include other things in your pipeline (feature engineering).
Not sure I get how DVC play a part in this. Probably just in the way to deploy the model (e.g. via the DVC API which would be similar to this -- going back to the "model registry" idea). Still not high-level enough IMO but b and c def. seem related.
Right, other than the model registry idea, there's not much of a clear pattern here for how to use DVC.
Hmmm... By many models do you mean actually different models with different goals (would relate with "model registry'), or multiple versions of a same model in development? I usually assume the typical ML pipeline/project ends up in a single model.
Sorry, I meant many experiments from the same pipeline.
More feedback (from https://iterativeai.slack.com/archives/C6YHPP2TB/p1621617453043300):
From @mnrozhkov
- Batch Scoring project use case: itβs a common for large companies like Telecoms, Banks & FinTech
- for production running we could use Airflow
From @dmpetrov
- E2E from getting data from DB (or Spark) to training and setting up batch scoring (AirFlow prod)
- An external dvc-airflow integration https://github.com/covid-genomics/airflow-dvc
βοΈ From these comments I take 1) there's support for covering the "batch scoring" scenario, 2) there's interest in certain integrations, specifically Airflow (I need to play with it β) -- maybe also MLFlow? and 3) an e2e case could be a meaningful way to present some of these topics.
Also, @shcheklein shared https://neptune.ai/blog/model-registry-makes-mlops-work with me (on the "model registry" idea). I think this answers the Q of how model registries relate to MLOps/ "in production". Summary:
collaborative hub where teams can work together at different stages of the ML lifecycle [from (after) experimentation to production]... allows to publish, test, monitor, govern and share [models] all the key values (data, config, env, code, versions, and docs) are in one place
centralized tracking system that stores lineage, versioning, and related metadata for published ML models. (1) provide a mechanism to store model metadata (2) connect independent model training and inference processes by acting as a communication layer [metadata:] identifier, name, desc?, version, date, performance, path to the serialized model, and stage of deployment (dev, shadow-mode, prod, etc.)
Nice, @jorgeorpinel! The comments on batch scoring and model registry use cases look good to me.
there's interest in certain integrations, specifically Airflow (I need to play with it β) -- maybe also MLFlow?
Yes to Airflow since it is the default choice for pipeline orchestration, although might be worth looking into some alternatives like prefect (see https://neptune.ai/blog/best-workflow-and-pipeline-orchestration-tools).
MLFlow is probably better left for the experiment management use case since its focus is on tracking and comparing experiments rather than executing pipelines.
Here's a list proposal with 4 big ideas that group most of the concepts we've discussed (with overlaps):
DVC in Production (rel. https://github.com/iterative/dvc.org/pull/2506) (intro to MLOps) Training remotely Deploying models (CLI or API) Keep pipelines, artifacts in sync between environments Batch scoring a.k.a. "DVC for ETL" + Distributed computing + Parallel exec?
ML Model Registry Model lifecycle (training, shadow, active, inactive) Automated/Continuous training (remotely) Discovery and reusability Deploying models Batch scoring example + Real-time inference
Production Integrations Databases (e.g. SQL dump versioning/preprocessing) Spark (e.g. remote training) AirFlow (e.g. batch scoring) Kafka (e.g. real-time predictions)
End-to-end scenario with a combination from above e.g.: Importing (versioning?) data from Spark (Automated) Training remotely MLOps via Model Registry Batch scoring (AirFlow integration)
Thanks @jorgeorpinel ! Sounds good, what/where can we get the full list of uses case that we write/consider to write, etc? (I assume that this ticket is still about "prod envs"?
E.g. where should we put "Experiments tracking/management" / "ML bookkeeping" case, for example?
All use case ideas we have in GH have been consolidated here (see original desc.) β we could even close some/all β except https://github.com/iterative/dvc.org/issues/2270 (an epic itself) and https://github.com/iterative/dvc.org/issues/2512 (new, discussing).
I should prob make an epic/story ticket to close this and maybe some of the other issues linked above β
Resulting list of ideas: #2544
Closing spike.
CI/CD for ML (WIP) #2404 is a bit broad so far but in the process of being narrowed down. Not clear what it will cover exactly but likely at least continuous integration using DVC + CML mentions.Cc @casperdclzooregistry https://github.com/iterative/dvc/issues/2719#issuecomment-555373498. Whether using a DVC repo to centralize ML model management is fundamentally different from general data registries (possibly our most popular UC) has been a recurrent question. I think that probably yes but not sure how exactly. Cc @shchekleinUPDATE: Jump to https://github.com/iterative/dvc.org/issues/2490#issuecomment-845449951