[SPIKE] cases: list of ideas (related to prod envs)

jorgeorpinel commented 3 years ago

Most of the existing ideas summarized here have something to do with ML models, I think.

Training on a remote machine #194 was the first idea and would narrowly focus on remote SSH executors. We can prob consider it part of 3.0 experiments now? (I can't find the related discussion but there's one cc @pmrowla)
DVC in Production #862 sounds broad but it intends to define and doc a best method around "synchronizing" DVC pipelines between experimental local dev and working prod envs (I think). Cc @dmpetrov
CI/CD for ML (WIP) #2404 is a bit broad so far but in the process of being narrowed down. Not clear what it will cover exactly but likely at least continuous integration using DVC + CML mentions. Cc @casperdcl
Deploying models for real-time inference #2431 may be a bit narrow but adds interesting edge to the model deployment topic, specifically using "model registries" and real-time data streaming. Cc @dberenbaum
Model ~~zoo~~ registry https://github.com/iterative/dvc/issues/2719#issuecomment-555373498. Whether using a DVC repo to centralize ML model management is fundamentally different from general data registries (possibly our most popular UC) has been a recurrent question. I think that probably yes but not sure how exactly. Cc @shcheklein

Extracted from #820

UPDATE: Jump to https://github.com/iterative/dvc.org/issues/2490#issuecomment-845449951

jorgeorpinel commented 3 years ago

And a side question: is this direction a higher priority than Experiments-related use cases atm? (see #2270)

shcheklein commented 3 years ago

Some thoughts:

CI/CD is taken care of by @casperdcl . It takes time to iterate but we will get there. And I think we'll be fine, at lease with this specific title.
Deploying models for real-time inference - yep, feel too narrow, need to find a better angle
Model zoo - is too high level concept I think (model zoo is close to a product). I think we can start with a model registry?

Some ideas for this list:

Model Management and/or Model Lifecycle - explain DVC from the models angle - we capture all information that is relevant to models - data, weights, metrics, experiments - and allow people to navigate
Model Registry - discovery and reusability
Experiments tracking/management - here we should sell W&B, MlFlow, etc - rapid iterations, live metrics + other metrics + navigation

shcheklein commented 3 years ago

title is confusing cases: MLOps direction for me ... why not just - cases: list of use case to write ?

jorgeorpinel commented 3 years ago

OK we're going to try to make this into a spike to come up with actionable items within 7 days or less hopefully. Please help if you can guys. I'll tag people via chat... ⌛

dberenbaum commented 3 years ago

It might help to start with thesis statements instead of topics. Thesis statements would be like single-sentence use cases arguing for the utility of the products in given scenarios. Use cases are more persuasive writing compared to the explanatory writing of other docs, so a topic may not clarify what we plan to say about it. This will probably take more time and debate, but hopefully we will have more clarity in deciding which use cases to pursue and in writing the use cases. What do you think?

jorgeorpinel commented 3 years ago

title is confusing cases: MLOps direction for me ... why not just - cases: list of use case to write ?

Because we also have cases: Experiments #2270. That seemed like a totally different direction from all the previous ideas summarized here (mainly from #820), which I think at least somewhat relate to MLOps? Happy to change the title but this is not the a comprehensive list of use case ideas in all possible product directions.

jorgeorpinel commented 3 years ago

Model Management and/or Model Lifecycle - explain DVC from the models angle - we capture all information that is relevant to models - data, weights, metrics, experiments - and allow people to navigate Model Registry - discovery and reusability

I have a feeling that model registries aren't different enough from data registries to write another full use case on that. But maybe it can be part of a Model Mgmt/Lifecycle use case. I like that idea! It could also cover or mention some of the topics above (training remotely, deployment, real-time predictions).

shcheklein commented 3 years ago

Happy to change the title but this is not the a comprehensive list of use case ideas in all possible product directions.

The way I initially understood the title cases: new directions and the meaning of this research is to consolidate all possible ideas (w/o this split - experiments, ml models - which is hard for me to understand tbh - e.g. why experiments are not about models?).

The title for that ticket you mention about experiments was about one specific use case to my mind.

I have a feeling that model registries aren't different enough from data registries to write another full use case on that.

it's a matter of what we are optimizing here. I would not be trying to generalize by sacrificing the initial goal - more people come, see the high level title that resonates with them . It's fine that they will overlap internally.

In this specific case - I think model registry can be significantly different.

jorgeorpinel commented 3 years ago

why experiments are not about models

Sure, it all connects. But here I'm thinking mostly about solutions for deploying and using ml models via DVC/CML e.g. production environments, model deployment, etc. Sorry for the confusion...

So it looks like so far the better-defined scenarios are

synchronizing between development and production ml models (#862)
ml model registry (construction? usage?)
ml model lifecycle/management (see https://github.com/iterative/dvc.org/issues/2490#issuecomment-843685144)

jorgeorpinel commented 3 years ago

It might help to start with thesis statements instead of topics — single-sentence use cases arguing for the utility of the products in given scenarios.

@dberenbaum

you can use DVC and CML to deploy ml models to production, and sync back results/status with the master repo
you can package and ship (pre-trained) ml models to a central registry and make DVC projects downstream that use and depend on them.
DVC helps you develop and manage ml models throughout their whole lifecycle (needs detailing)

Keep in mind a) this is not my area of expertise and b) this is based on preliminary understanding of the proposals, so my explanations above may be inexact.

dberenbaum commented 3 years ago

Thanks, @jorgeorpinel! I didn't mean to suggest that you should bear responsibility for developing each thesis statement, or that each one needs to be perfected.

1. you can use DVC and CML to deploy ml models to production, and sync back the model learning to your development env/team

We have a few use case ideas around "production" and/or "deployment," and it's not clear to me what they mean. There are different scenarios that I have seen described as production deployments: a. Automated training: Run a scheduled, automated training pipeline to keep your model updated with the latest data (this seems to be #862). The retrained model might then be used for the scoring scenarios below. b. Batch scoring: Run a scheduled, automated scoring pipeline to always have updated predictions. c. Real-time scoring: Submit data as needed to an API that returns model scores (see #2431).

I'd probably vote to focus on b since a solution for c might not be fully developed yet. a could maybe be included as part of it if it's not too complex, but to me it's being covered by the CI/CD use case in development.

3\. DVC helps you develop and manage ml models throughout their whole lifecycle (needs detailing)

As @shcheklein has mentioned, this can either be about a single model or many models, which might be different use cases.

For a single model, track, visualize, and analyze everything about your experiment, including code, parameters, metrics, plots, data, training DAG, and any other artifacts included in your repo.

For many models, try many different experiments and track them, enabling you to compare, select, reproduce, and iterate on any experiments.

jorgeorpinel commented 3 years ago

a. Automated training

This could or could not be considered related to "in production". Training somewhere seems rather like a pre-requisite. I think it has more to do with CI/CD (which can be part of a prod deployment workflow, so there's overlap). This can probably be covered initially in #2404 indeed. Cc @casperdcl

b. Batch scoring

Is this basically ETL where E=get chunk of data, T=run pre-trained model, L=store/upload scores ? That could be part of a use case but may still not be high-level enough.

c. Real-time scoring

Not sure I get how DVC play a part in this. Probably just in the way to deploy the model (e.g. via the DVC API which would be similar to this -- going back to the "model registry" idea). Still not high-level enough IMO but b and c def. seem related.

jorgeorpinel commented 3 years ago

ml model lifecycle/management

this can either be about a single model or many models

Hmmm... By many models do you mean actually different models with different goals (would relate with "model registry'), or multiple versions of a same model in development? I usually assume the typical ML pipeline/project ends up in a single model.

BTW can we clarify what we mean by "model lifecycle"? Maybe training, active, inactive (related to "in production") or planning, data eng, modeling (much broader topic). Cc @shcheklein

initial goal - more people come, see the high level title that resonates with them . It's fine that they will overlap

Going back to this (which is why titles are important too), I think "DVC in Production" is a really good umbrella concept to begin with, keeping in mind it would be the first use case in this direction. It can have a story (maybe sections) that cover several of the scenarios we've discussed above. Later on we could split into multiple use cases if that's better. WDYT?

UPDATE: See quick draft (idea) in #2506

dberenbaum commented 3 years ago

Is this basically ETL where E=get chunk of data, T=run pre-trained model, L=store/upload scores ? That could be part of a use case but may still not be high-level enough.

Yup, although T could include other things in your pipeline (feature engineering).

Not sure I get how DVC play a part in this. Probably just in the way to deploy the model (e.g. via the DVC API which would be similar to this -- going back to the "model registry" idea). Still not high-level enough IMO but b and c def. seem related.

Right, other than the model registry idea, there's not much of a clear pattern here for how to use DVC.

Hmmm... By many models do you mean actually different models with different goals (would relate with "model registry'), or multiple versions of a same model in development? I usually assume the typical ML pipeline/project ends up in a single model.

Sorry, I meant many experiments from the same pipeline.

jorgeorpinel commented 3 years ago

More feedback (from https://iterativeai.slack.com/archives/C6YHPP2TB/p1621617453043300):

From @mnrozhkov

Batch Scoring project use case: it’s a common for large companies like Telecoms, Banks & FinTech

for production running we could use Airflow

From @dmpetrov

E2E from getting data from DB (or Spark) to training and setting up batch scoring (AirFlow prod)

An external dvc-airflow integration https://github.com/covid-genomics/airflow-dvc

☝️ From these comments I take 1) there's support for covering the "batch scoring" scenario, 2) there's interest in certain integrations, specifically Airflow (I need to play with it ⌛) -- maybe also MLFlow? and 3) an e2e case could be a meaningful way to present some of these topics.

Also, @shcheklein shared https://neptune.ai/blog/model-registry-makes-mlops-work with me (on the "model registry" idea). I think this answers the Q of how model registries relate to MLOps/ "in production". Summary:

collaborative hub where teams can work together at different stages of the ML lifecycle [from (after) experimentation to production]... allows to publish, test, monitor, govern and share [models] all the key values (data, config, env, code, versions, and docs) are in one place

centralized tracking system that stores lineage, versioning, and related metadata for published ML models. (1) provide a mechanism to store model metadata (2) connect independent model training and inference processes by acting as a communication layer [metadata:] identifier, name, desc?, version, date, performance, path to the serialized model, and stage of deployment (dev, shadow-mode, prod, etc.)

dberenbaum commented 3 years ago

Nice, @jorgeorpinel! The comments on batch scoring and model registry use cases look good to me.

there's interest in certain integrations, specifically Airflow (I need to play with it ⌛) -- maybe also MLFlow?

Yes to Airflow since it is the default choice for pipeline orchestration, although might be worth looking into some alternatives like prefect (see https://neptune.ai/blog/best-workflow-and-pipeline-orchestration-tools).

MLFlow is probably better left for the experiment management use case since its focus is on tracking and comparing experiments rather than executing pipelines.

jorgeorpinel commented 3 years ago

Summary (again)

Here's a list proposal with 4 big ideas that group most of the concepts we've discussed (with overlaps):

DVC in Production (rel. https://github.com/iterative/dvc.org/pull/2506) (intro to MLOps) Training remotely Deploying models (CLI or API) Keep pipelines, artifacts in sync between environments Batch scoring a.k.a. "DVC for ETL" + Distributed computing + Parallel exec?

ML Model Registry Model lifecycle (training, shadow, active, inactive) Automated/Continuous training (remotely) Discovery and reusability Deploying models Batch scoring example + Real-time inference

Production Integrations Databases (e.g. SQL dump versioning/preprocessing) Spark (e.g. remote training) AirFlow (e.g. batch scoring) Kafka (e.g. real-time predictions)

End-to-end scenario with a combination from above e.g.: Importing (versioning?) data from Spark (Automated) Training remotely MLOps via Model Registry Batch scoring (AirFlow integration)

shcheklein commented 3 years ago

Thanks @jorgeorpinel ! Sounds good, what/where can we get the full list of uses case that we write/consider to write, etc? (I assume that this ticket is still about "prod envs"?

E.g. where should we put "Experiments tracking/management" / "ML bookkeeping" case, for example?

jorgeorpinel commented 3 years ago

All use case ideas we have in GH have been consolidated here (see original desc.) — we could even close some/all — except https://github.com/iterative/dvc.org/issues/2270 (an epic itself) and https://github.com/iterative/dvc.org/issues/2512 (new, discussing).

I should prob make an epic/story ticket to close this and maybe some of the other issues linked above ⌛

jorgeorpinel commented 3 years ago

Resulting list of ideas: #2544

Closing spike.

iterative / dvc.org

[SPIKE] cases: list of ideas (related to prod envs) #2490

Summary (again)