Closed iesahin closed 2 years ago
Thanks for the clear summary! A few quick thoughts:
9-cnn-model
. Why convert to a CNN other than to match the existing dvc-checkpoints-mnist
implementation?checkpoints: true
to dvc.yaml
, then adds checkpoints to the script in the most manual and language-agnostic way, then consolidates to use make_checkpoint
, and then further consolidates/enhances to use dvclive
.I'm a little unclear on the point of
9-cnn-model
. Why convert to a CNN other than to match the existingdvc-checkpoints-mnist
implementation?
The first 8 tags use an MLP model for quick iteration. It's good to expose data access & versioning, pipelines, params but not for the checkpoints I think. My intention was to provide a very simple model that we can run on Katacoda, and actually both MLP and CNN run on it. The CNN model is also written in Tensorflow and fits the rest of the pipeline. The change from MLP to CNN needs a single parameter change, in params.yaml
. We can add more models in this fashion to models.py file.
Is there an equivalent to what is now in https://dvc.org/doc/start/experiments?
Our Katacoda Experiments scenario uses the new project. It's not as extensive as the document but there is enough room to play with the parameter values for experimentation, so yes, I can update the document quickly.
You mention there's no clear order in 9-12, but I wonder if we should have some order here. I could see an order like 9->12->10->11, where the tutorial first adds
checkpoints: true
todvc.yaml
, then adds checkpoints to the script in the most manual and language-agnostic way, then consolidates to usemake_checkpoint
, and then further consolidates/enhances to usedvclive
.
It's possible to change the tags to reflect this order. I think I see the language agnostic way as a last resort, so I put it to the end. If we decide to merge checkpoints and this project, I can use this order. (BTW, I can create a similar R or Julia (or any other language) project for the signal file demo.)
I think these approaches are alternative to each other and if the user decides on one of them, they will stick to it. Branches like you used in the checkpoints project may make more sense. After we get the full pipeline, we either use checkpoint: true
, or make_checkpoint()
or a signal file, not all of them.
Thank you @dberenbaum
Good stuff @iesahin !
Quick question - does it make sense to reverse the order and start with a simple project, singe script (or even Jupyter?) and go into pipelines after that? To highlight the experiments first and "grow" into more advanced stuff? (It would contradict with branching though - we would need to keep applying the pipelines changes to every branch?
Some comments:
The source code and related files will be copied to this repository instead of downloading from S3. I don't think of a use case to put the files to S3 instead of a Github repository. Why was so?
since we start with git init
in the get started flow, we needed a way to quickly "download" files
The new project repository will be https://github.com/iterative/dvc-get-started
is this project enough to completely replace the existing get started? then it makes sense. Otherwise we can rename both - put some prefix/suffix to both?
We can put 0-8 to the main branch and convert other tags to branches as there is no clear progress between these.
if we can do meaningful branches this is even better, we'll have a good features coverage for the DVC Studio. But on the other hand it might complicate the get started experience?
@dberenbaum @flippedcoder @dmpetrov what's your take so far - can this project and this approach replace/cover the existing mnist and be enough to cover the experiments + checkpoints.
does it make sense to reverse the order and start with a simple project, singe script (or even Jupyter?) and go into pipelines after that? To highlight the experiments first and "grow" into more advanced stuff? (It would contradict with branching though - we would need to keep applying the pipelines changes to every branch?
That's certainly possible. I think we can have multiple branches and multiple tags in each of them, like
main
branch same as the aboveexperiments
branch ➡️ 01-exp-run
➡️ 02-exp-apply
➡️ 03-exp-branch
...checkpoints
branch ➡️ 01-checkpoint_true
➡️ 02-make-checkpoint
➡️ 03-signal-file
...data-access
branch ➡️ 01-dvc-get
➡️ 02-dvc-import-url
➡️ 03-dvc-import
...data-versioning
branch ➡️ 01-dvc-add
➡️ 02-dvc-push
....cml
branch ...(experiments
and checkpoints
can be merged IMO, but that's not a strong opinion.)
I set the tags similar to the current GS project because it's easier to discuss and after that, we can use the same material for further improvements. This is basically v0, so it seems a better approach to use something we used to.
It will be a bit more difficult but we can have multiple generate
scripts for each branch, using the same source and data with different DVC commands.
since we start with
git init
in the get started flow, we needed a way to quickly "download" files
I mean, they are already within code/
directory. Why don't we just copy them? :)
is this project enough to completely replace the existing get started? then it makes sense. Otherwise we can rename both - put some prefix/suffix to both?
0-8
is ready now. I'll test the whole GS material with this next week, also writing the modifications for the checkpoints. It's actually a matter of discussion and management to actually replace the current docs with this. I'm in favor of replacing, I can write a fully tested GS documentation and put these to CI to check the command changes, but you may think it's not worth it and I'm not particularly hard about this.
I'm putting dvc-
prefix to all DVC-related repositories, because iterative has other projects not directly related with DVC but I'm completely open to suggestions and changes in that regard. Once decided, it won't change much.
if we can do meaningful branches this is even better, we'll have a good features coverage for the DVC Studio. But on the other hand it might complicate the get started experience?
I don't think having multiple storylines in the same repository makes it difficult to understand. We'll write about certain use cases in certain tutorials and won't mention the rest. Having a single repository with the well-defined dataset(s) will reduce the burden of telling our setup/data/source code every time, e.g., if a reader starts from the experiments tutorial but doesn't know how to init the DVC repository, we can put a link there to check that particular step in the docs.
BTW today is 20210428
that I used as a seed in params.yaml
. The current project has 20170428
and I think it's an anniversary 🥳 🎉
@dberenbaum @shcheklein @flippedcoder @dmpetrov @jorgeorpinel
The latest generate.bash
version generates the repository and creates a push script. I've used it to generate https://github.com/iterative/dvc-get-started (that doesn't have -mnist
as a suffix.)
That's certainly possible. I think we can have multiple branches and multiple tags in each of them, like
* `main` branch same as the above * `experiments` branch ➡️ `01-exp-run` ➡️ `02-exp-apply` ➡️ `03-exp-branch` ... * `checkpoints` branch ➡️ `01-checkpoint_true` ➡️ `02-make-checkpoint` ➡️ `03-signal-file` ... * `data-access` branch ➡️ `01-dvc-get` ➡️ `02-dvc-import-url` ➡️ `03-dvc-import` ... * `data-versioning` branch ➡️ `01-dvc-add` ➡️ `02-dvc-push` .... * `cml` branch ...
I like the idea of these different scenarios as different ways to get started, regardless of whether they are in the same repo or not.
For checkpoints, I don't think we need all these branches for getting started. I'd vote to pick one workflow (probably make_checkpoint
or live
).
After yesterday's meeting and @dmpetrov 's comments, I thought of some more interesting projects for getting started:
We can have a transfer learning project that shows to download models, modify them and experiment on them, e.g., a project that downloads VGG-16, reuses for face recognition in a personal photo album might be more interesting for the general reader as well. A Model Access & Experimentation tutorial can be written with this. Another pipelines/dependencies tutorial that the user adds images to a directory and runs the tool for image classification might also look good.
It's possible to use GANs in checkpoints tutorial. GANs can be used to generate images that look like Van Gogh (or Kandinsky or whom they like) paintings. The user can compare the resulting pictures instead of a bunch of metrics.
An OCR pipeline that creates a searchable index of all the images. You save all the images in a directory and the pipeline indexes them after running OCR. A similar voice recognition pipeline for audio/podcast files may be interesting as well.
A video classifier that downloads videos from YouTube (or some public resource with permissible license or Elle's videos) as a dependency, splits into I-frames, and builds some index, using speech recognition and OCR. (On second thought, this could become a product on its own but let's leave it as an exercise. 😄 )
There can be a Data Science project with semi-structured data as well, like analyzing website statistics continuously or analyzing/predicting cryptocurrency prices. CML for BTC has a sales pitch these days.
All of these can be provided in Docker containers to try and can employ different DL/ML libraries as well. (I suppose we won't try to run them on Katacoda. :) )
Some of these take less time for me. I did face recognition, video OCR, handwriting/OCR before but all of them are doable.
@shcheklein @dberenbaum @jorgeorpinel @flippedcoder
These are great ideas for projects! My take is that a fairly straightforward project with which most users are already familiar (like MNIST) works best for getting started, so that people can start using dvc as quickly as possible. However, more interesting projects are way better for other docs like in-depth tutorials, blog posts, etc.
Hi. This discussion is getting long 🙂 Once it seems mostly resolved could we update the issue description with a summary of decisions? For now I added a subtask (to review docs examples using the resulting example repo).
My take is that a fairly straightforward project with which most users are already familiar (like MNIST) works best for getting started, so that people can start using dvc as quickly as possible.
This was what we discussed with @shcheklein too. I'm just thinking out loud some possibilities. 💭 I'll stick to the simpler project for the introduction, then maybe we can write on some more interesting use cases. I agree. @dberenbaum
Once it seems mostly resolved could we update the issue description with a summary of decisions? For now I added a subtask (to review docs examples using the resulting example repo).
Thank you @jorgeorpinel I'm keeping it to sync the progress of the new GS project. For the longer run, it may be better to have a discussion in the docs repository. I'll close this after we decide on checkpoints
' status.
I added checkpoints features to the checkpoints branch. The tags are created sequentially but don't have to. I can remove the checkpoints
branch and have only checkpoints-
tags as well. base
branch is supposed to be the basis of all other branches but I decided to use main-6-evaluate
for checkpoints
for a cleaner codebase. We can use it for future branches or merge it to main
.
You can review the resulting repository at https://github.com/iterative/dvc-get-started
My preliminary tests are ok but I want to create docker containers for each of these tags and test thoroughly while writing READMEs. I'll use these containers also in the new Katacoda checkpoints scenario.
Please let me know if a Fashion-MNIST based experimentation/checkpoints branch is more favorable. I'll add it also to the dataset registry and use it for checkpoints.
Also, any comments about the naming of tags are welcome.
Thank you.
@jorgeorpinel @dberenbaum @shcheklein @flippedcoder
I see it has both a main
(default) branch, and a base
branch.
master
? As per standard usagebase
. Isn't the default branch the "base" of other ones too?Maybe rename
base
tobasic
orcommon
or make it a tag (e.g.0-basic
)
🙏🏼 @jorgeorpinel
1. Why not leave the default branch name as `master`? As per standard usage
Github changed the default branch name to main
for some time. I don't know if anyone really finds the name of master
branch offensive but I'm not from those parts of the world. My other repositories are all default to master
but this is an easy change. We can discuss this in #32 with @dberenbaum and @shcheklein .
On first glance I'm not getting the difference between the default one and base. Isn't the default branch the "base" of other ones too?
You're right. Initially main
and checkpoints
were both using base
. I changed this in 8d2c31dcb6 as:
We can even have a single branch now. Your recommendation is also fine and we already have base-1-dvc-init
and base-0-git-init
tags. I can move these (including the checkpoints
or not) under a single branch now.
Github changed the default branch name to main for some time.
Interesting. I haven't created repos on GH in a long time then! Git's default is still master
though. But OK whatever is the default in the iterative org should be good, then (p.s. we can change that default).
Agree on using the base
tags (instead of a branch).
@iesahin it looks good to me!
A few questions/comments:
metrics
in params
looks a bit weird and feels unnecessary, wdyt? If we remove it- we can simplify the code, params files, etc.dvc-get-started-mnist
- could we get rid of that one?Bonus things:
main
branch - let's adhere to the current GH defaults, it makes sense and we want to care about those folks this topic is important.
* `metrics` in `params` looks a bit weird and feels unnecessary, wdyt? If we remove it- we can simplify the code, params files, etc.
Hmm, I couldn't decide what to put to metrics list there and put whatever seems relevant. Can I keep all metrics there?
it makes sense to do apply and do a commit in one of the branches with improved results and final metrics. In this case it's even good to have branches. But ideally it should be a bit more realistic - we complicate params file in the branch, we add checkpints, we add live, we improve metircs and capture live metrics we do commit. It will look way better in the Studio.
I can modify src/
and params.yaml
per branch/tag. I tried to make modifications minimally across the tags but we can have completely different params/pipelines/source in these. I need to wrap my mind around to make features more visible.
writing actual Get Started will drive certain requirements, I think we can start drafting it?
First, I would like to update Katacoda scenarios with dvc-get-started
. Then I'll update the relevant docs one-by-one, or draft new GS docs.
Our requirements will always change, I can adapt the project to all.
can we make it look realistic - spread commits in time, use two people for different forks
You want me to write a screenplay with Git history? 😆
Would you prefer fictional characters like Raul Owl <raul.owl@dvc.org>
and Laurie Owlet <laurie.owlet@dvc.org>
, or not so fictional names? 😄
I'll start the history of the main branch from DVC 1.0 release date and the checkpoints from DVC 2.0.
can we capture more plots - e.g. confusion matrix, we have a template for this
I put TP/FP/TN etc. metrics because of this. I'll take care of this.
Hmm, I couldn't decide what to put to metrics list there and put whatever seems relevant. Can I keep all metrics there?
yep, there are not many of them for now, should be fine to keep all of them
First, I would like to update Katacoda scenarios with dvc-get-started. Then I'll update the relevant docs one-by-one, or draft new GS docs.
👍
Looks like this is outdated.
TODO
experiments
branch for various experiments (#38) (Done as a separate repository)studio
branch to be used as a studio showcasemain
topipeline
branch (#39) (Done as a separate repository)main
to contain only README (#40) (Cancelled)dvc-get-started-mnist
Older Proposal Below.
We have a new Get Started project aimed towards experimentation features in DVC 2.0.
A step-by-step approach similar to the current example-get-started project is necessary for exposition. Creating the project from development sources to provide a clear history, without manual intervention also provides easier maintenance.
Tags
0-git-init
: Empty Git repository initialized.1-dvc-init
: DVC has been initialized..dvc/
with the cache directory created.2-import-data
: Raw data filesdata/raw/
downloaded and tracked with DVC usingdvc import-url
from the dataset-registry. First.dvc
file created.3-config-remote
: Remote HTTP storage initialized. It's a shared read only storage that contains all data artifacts produced during next steps.4-source-code
: Source code copied from the generation repository.5-prepare-stage
: Createdvc.yaml
and the first pipeline stage withdvc stage add
. It converts MNIST data from IPX format to NumPy format and stores indata/prepared/
6-preprocess-stage
: Create a new stage namedpreprocess
to convert the data files produced in the previous stage into a format ready to supply to Keras.7-train-stage
: Create a new stage that depends on the data files produced inpreprocess
andmodel.py
files. The model file provides the model according to the parameters inparams.yaml
. This stage also producestrain.log.csv
file to track the training performance metrics.8-evaluation
: Evaluation stage. Runs the whole pipeline to producemetrics.json
file.9-cnn-model
: Up until this stage, the model we use is MLP with a single hidden layer. This stage demonstrates the use ofdvc exp run
to create more fine grained experiments and implicit checkpoints at the end ofdvc exp run
. This corresponds tobasic
branch ofdvc-checkpoints-mnist
.10-make-checkpoint
: Updatesevaluate.py
with calls tomake_checkpoint
to show the Python API usage. This corresponds tomake_checkpoint
branch ofdvc-checkpoints-mnist
.11-dvclive
: Adds DVClive features totrain.py
to demonstrate their usage. This corresponds tofull_pipeline
branch ofdvc-checkpoints-mnist
.12-signal-file
: Usessignal-file
features inevaluate.py
to show the language-agnostic features checkpoints. This corresponds tosignal_file
branch ofdvc-checkpoints-mnist
.Discussion Points
I'll defer checkpoints' implementation until we decide on a path to merge or keep the checkpoints project separate.
For the tags 9-12, we can have renamed files within the same repository to make comparisons, e.g., in
10-make-checkpoint
, we can have aevaluate-make-checkpoint.py
file that's similar toevaluate.py
except the checkpoint usage.We can put
0-8
to themain
branch and convert other tags to branches as there is no clear progress between these.There is no
dvc add
step in this setup unlike inexample-get-started
. It's possible to use Fashion-MNIST for this. It can be downloaded manually and put insidedata/
to update the pipeline. It's possible to parameterizeprepare.py
with the path of the dataset or we can demonstrate how to replace the dataset by overwriting it. Or we can do vice versa,dvc add
MNIST in step 2 and have advc import-url
later as step 9 to update the dataset with Fashion-MNIST.The script will use
dvc stage add
instead ofdvc run
for all stages.dvc exp run
will be favored todvc repro
also.The source code and related files will be copied to this repository instead of downloading from S3. I don't think of a use case to put the files to S3 instead of a Github repository. Why was so?
The new project repository will be
https://github.com/iterative/dvc-get-started
.