iterative / dvc.org

📖 DVC website and documentation
https://dvc.org
Apache License 2.0
330 stars 387 forks source link

blog/use-case: Unit tests for data using DVC #2512

Closed iesahin closed 2 years ago

iesahin commented 3 years ago

In Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX), authors state

Similarly to how code is at the heart of software, data is at the heart of ML. Data management represents serious challenges in production ML. Perhaps the simplest analogy would be to think about what constitutes a unit test for data. Unit tests verify expectations on how code should behave, by testing the contracts of the pertinent code and instilling trustworthiness in said contracts. Similarly, setting explicit expec­tations on the form of the data (including its schema, invariants and value distributions), and checking that the data agrees with implicit ex­pectations embedded in the training code can, more so together, make the data trustworthy enough to train models on.

DVC can be used to write unit tests on data retrieved from the wild. It can be used to check the basic statistics and distribution, clean up and sanitize the data to make it available to the model. Models usually have (implicit) assumptions about the distribution of the data and when these change (data drift), the model doesn't perform well.

A blog post or UC document about this case may be useful.

shcheklein commented 3 years ago

I think this is related to what @casperdcl is working on - CI/CD for ML #2404 - it should have at least some part about this

dberenbaum commented 3 years ago

🤔 This diagram from https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning#mlops_level_2_cicd_pipeline_automation describes a high-level mature ML workflow:

image

IMO #2404 addresses the final step and highest level of maturity for the model experimentation/development/test phase as shown in boxes 2 and 3 in the diagram.

What @iesahin wrote seems more focused on the production phase (data in the wild) where new unknown data is input to a deployed model. I'd probably see it as being the "data validation" step in box 4 in the diagram, and maybe also related to box 6 (at least the mention of data drift).

Ultimately, I guess it depends on what @iesahin intended. Either way, agree that using DVC to validate data could be valuable.

iesahin commented 3 years ago

Thanks for the diagram @dberenbaum

I read the paper as a part of Coursera MLOps course. The idea is continously improving the models by training them with new data, but this new data may have different properties than the original data used to train the model and may cause a decrease in performance.

For example, if we have a face detection model trained before the pandemic and now we want to improve it, newer data will contain a larger proportion of masked faces. Some of the assumptions in the original model about the original data set may not hold. If we write unit tests about these assumptions, improving the model becomes safer with less technical debt.

This can be used in CML as well, a data validation step is almost always required in online improvements. I think #2404 is more about higher level description. What I have in mind is a more specific, concrete example. We can have two different sources of data, a model trained with the first but we want to improve the model with the second without a complete retraining.

This is about improving the pipeline reliability by using statistical unit tests. If something weird is supplied to the pipeline, it should notify before trying to retrain the model. A similar document for CML may also be nice, but here I would like to keep the scope limited to manual retraining.

shcheklein commented 3 years ago

Hmm. To me "Unit testing" is part of "CI" and I would never consider this as part of the production. It's a development (or pre-prod) phase. Data here is not that different from code to my mind. Either we use it in the existing model (production) or as an input to researchers (implement new models, etc). CI/CD can work the same way, it's just in the data lifecycle even the model development phase can be considered "production".

Scenario (not even the right term, more like "create sense of what is possible") I would try to describe as part of the #1942.

I have a data registry (in our case repo) and there are datasets that we keep updating there. We should use PRs, on a PR we can run simple safety checks, legal checks, schema checks, etc using CML (CI for data). When it's done we merge (CD for data part) and people could use this updated dataset now.

I think #2404 is more about higher level description.

yes, it means that you talk about small tutorial that would go nearby with that Use Case. Or a blog post indeed. Or how-to.

iesahin commented 3 years ago

I have a data registry (in our case repo) and there are datasets that we keep updating there. We should use PRs, on a PR we can run simple safety checks, legal checks, schema checks, etc using CML (CI for data). When it's done we merge (CD for data part) and people could use this updated dataset now.

I don't assume a predefined data registry. What I propose is something like, if you write unit tests at first, you can be safer when new data comes along. This safer may mean different things in different cases but it's similar to software development where unit tests are used to check the regression (in performance) and updated requirements.

I think a tabular, structured data set may be more reflective to show this. Suppose we have built a logistic regression model in some data with 10% missing values in a column and we used some method to fill them. Now a new dataset comes with 80% of the values in that column missing. We need to be aware of this before supplying the data to the model for training.

"Writing unit tests for your assumptions (in DVC) saves a lot of headaches later" is the story what I would like to tell. We can automate it and tell this in CML as well but for this first instance, I would focus on the problem manually.

dberenbaum commented 3 years ago

@shcheklein I think applying traditional software terms like "unit testing" and "production" to these ML scenarios is what's confusing me. Here's what I had in mind related to the diagram above:

shcheklein commented 3 years ago

I think applying traditional software terms like "unit testing" and "production"

yep, that's I guess why I was thinking more about dev phase, rather than monitoring.

This definitely sounds like CI/CD, it's just distinct from the manual model development phase where the data is static and the focus is on experimentation.

yep. It's stretching it a bit though. I would still call this the productions phase and it requires indeed monitoring (like in any production engineering system, and running some data tests here is a data engineering realm for me)

shcheklein commented 3 years ago

So, if we talk more about data engineering (automated pipelines, prod monitoring, etc) here, how do you see DVC could help?

jorgeorpinel commented 3 years ago

tests on data retrieved from the wild (assumptions in the original model about the original data set may not hold) "data validation" and data drift using DVC to validate data could be valuable improving the pipeline reliability by using statistical tests (on the data)

These are the key words I see above. From my (non ML) PoV it seems like it's a very narrow thing (maybe a best practice?). I also wonder what's DVC's role here, other than codifying a pipeline stage.

The broader topic of data quality may be interesting though e.g. as mentioned above: releasing data like code via PRs with test, QA against stagong/prod models, etc. May relate to the existing Data Registry and Model Registry (idea, see https://github.com/iterative/dvc.org/issues/2490#issuecomment-853561046) cases.

shcheklein commented 3 years ago

I think DVC's role here (at least in "unit testing" a data registry scenario) is that it kinda implies Git flow - data "is" in Git, PR should be made to update it, CI could be run and it has access to that data, it can show you pass/not pass, etc.

dberenbaum commented 3 years ago

So, if we talk more about data engineering (automated pipelines, prod monitoring, etc) here, how do you see DVC could help?

In automated pipelines, those may be DVC pipelines that include data validation stages. Failures may trigger alerts and stop automatic deployment of updated models. Using DVC not only makes it easy to automate this pipeline, but also to checkout the data and outputs from failed pipeline runs for further analysis.

In prod monitoring, it's less clear yet if DVC has value. I think it potentially has some if the production data and model are tracked by DVC, since it takes care of provenance so that monitoring results can be directly tied back to specific versions of data and models.

shcheklein commented 3 years ago

Makes sense @dberenbaum . It's a bit indirect value (relatively to this specific topic) - it's not like DVC helps you with there "data tests", it's more like DVC could be used instead of Airflow (or along) and cover some those scenarios that data engineering pipeline would cover. In this case it would be good to start with "DVC for ETL"/"DVC for production pipelines"/"nee a better name" case?

Mention benefits like:

And then write short tutorials/examples on how this can be applied, including possibility to do unit tests on data (e.g. using libraries like "great expectations").

jorgeorpinel commented 3 years ago

Overlaps a bit with "DVC in Production" from https://github.com/iterative/dvc.org/issues/2490#issuecomment-853561046.

UPDATE: Mentioned in #2544 in case you want to close this for consolidation purposes.