kurianbenoy commented 4 years ago

Title

ML Models and Dataset versioninig

Description

In this talk we will discuss the current best practices of organizing ML projects and why traditional open-source tools like Git, And I will be discussing about one of the best practises ie ML models and Dataset versioning

Duration

[x] 30 min
[ ] 45 min

Audience

Intermediate

Outline

In this talk we will discuss about the current best practices of organizing ML projects and why traditional open-source tools like Git and Git-LFS won't help us here.

Currently the life-cycle of any Machine learning model goes through following process:

a ML practitioner tries out new image classification algorithm with input dataset
He tweaks algorithms, tries other ideas and fix bugs. All in local system
Some of her training data might require long runs, and may change code while weights remains same
She keeps around the model weights and evaluation scores for all her runs, and picks which weights to release as the final model once she’s out of time to run more experiments.
She publishes her results, with code and the trained weights.

Git can’t handle large amount of data of GB’s of size. While Git-LFS comes with the in-build difficulty of supporting only 2 GBs of data at the maximum(Github limitations) and even more problems exist.

Data Version Control or DVC.ORG is an open-source, command-line tool written in Python. We will show how to version datasets with dozens of gigabytes of data and version ML models, how to use your favourite cloud storage (S3, GCS, or bare metal SSH server) as a data file backend and how to embrace the best engineering practices in your ML projects. Also, I will be discussing tools in the market for both experiment tracking and dataset versioning, and what are the best features of these products(PS: no comparison among one another).

Talk Outline

Startup Adventures
Challenges
Model and Dataset versioning?
How I discovered DVC?
Use case: Versioning Cats vs Dogs Deep Learning problem(8 min)
Conclusion

Slides

Additional notes

Kurian Benoy is an open-source contributor at CloudCV, DVC. He is the lead organiser of School of AI, Kochi and is an AI enthusiast working on Deep Learning and Computer Vision. Kurian is FOSSASIA Open TechNights WInner and gave a talk in FOSSASIA Open Tech submit about the keralarescue.in team.

I am an active kaggler and was the first person to introduce about Data Version Control in Kaggle and is among the top 10 contributors of dvc, so far.

[ ] Don't record this talk.

Check this if you don't want your talk to be recorded.

vinayak-mehta commented 4 years ago

@kurianbenoy Thanks for the proposal! Are you available to give this talk at next Saturday's meetup (Oct 19)?

kurianbenoy commented 4 years ago

@vinayak-mehta, I realised I won't be able to come for meetup cause: 1) I am new to Bangalore, and I am there to attend InOut Hackathon which starts at 9AM morning. I thought multi-tasking both the things together won't be a good idea. 2) I am not having my personal laptop, so I am doubtful about how much part of demo I can show.

I hope I can come to PyData Bangalore community one day :)

pydatabangalore / talks

ML Models and Dataset versioning #20