get-started-X: Write `generate` scripts for each project

iesahin commented 3 years ago

TODO

[x] Add an experiments branch for various experiments (#38) (Done as a separate repository)
(CANCEL) Add a studio branch to be used as a studio showcase
[x] Move current main to pipeline branch (#39) (Done as a separate repository)
[x] Add main to contain only README (#40) (Cancelled)
[x] Create Docker containers for each tag
[x] Update Katacoda GS:Experiments scenario and get rid of dvc-get-started-mnist
[x] Remove model metrics from params
(MOVE) Keep only the relevant parameters for each repository
[x] Create two fictional people for each branch, spreading the commits across time
[x] Add additional plots like confusion matrix
[ ] Review dvc.org docs examples (besides Get Started) that could use the resulting example repo. Mainly the Experiments docs

Older Proposal Below.

We have a new Get Started project aimed towards experimentation features in DVC 2.0.

A step-by-step approach similar to the current example-get-started project is necessary for exposition. Creating the project from development sources to provide a clear history, without manual intervention also provides easier maintenance.

Discussion Points

I'll defer checkpoints' implementation until we decide on a path to merge or keep the checkpoints project separate.
For the tags 9-12, we can have renamed files within the same repository to make comparisons, e.g., in 10-make-checkpoint, we can have a evaluate-make-checkpoint.py file that's similar to evaluate.py except the checkpoint usage.
We can put 0-8 to the main branch and convert other tags to branches as there is no clear progress between these.
There is no dvc add step in this setup unlike in example-get-started. It's possible to use Fashion-MNIST for this. It can be downloaded manually and put inside data/ to update the pipeline. It's possible to parameterize prepare.py with the path of the dataset or we can demonstrate how to replace the dataset by overwriting it. Or we can do vice versa, dvc add MNIST in step 2 and have a dvc import-url later as step 9 to update the dataset with Fashion-MNIST.
The script will use dvc stage add instead of dvc run for all stages. dvc exp run will be favored to dvc repro also.
The source code and related files will be copied to this repository instead of downloading from S3. I don't think of a use case to put the files to S3 instead of a Github repository. Why was so?
The new project repository will be https://github.com/iterative/dvc-get-started.

dberenbaum commented 3 years ago

Thanks for the clear summary! A few quick thoughts:

I'm a little unclear on the point of 9-cnn-model. Why convert to a CNN other than to match the existing dvc-checkpoints-mnist implementation?
Is there an equivalent to what is now in https://dvc.org/doc/start/experiments?
You mention there's no clear order in 9-12, but I wonder if we should have some order here. I could see an order like 9->12->10->11, where the tutorial first adds checkpoints: true to dvc.yaml, then adds checkpoints to the script in the most manual and language-agnostic way, then consolidates to use make_checkpoint, and then further consolidates/enhances to use dvclive.

iesahin commented 3 years ago

I'm a little unclear on the point of 9-cnn-model. Why convert to a CNN other than to match the existing dvc-checkpoints-mnist implementation?

The first 8 tags use an MLP model for quick iteration. It's good to expose data access & versioning, pipelines, params but not for the checkpoints I think. My intention was to provide a very simple model that we can run on Katacoda, and actually both MLP and CNN run on it. The CNN model is also written in Tensorflow and fits the rest of the pipeline. The change from MLP to CNN needs a single parameter change, in params.yaml. We can add more models in this fashion to models.py file.

Is there an equivalent to what is now in https://dvc.org/doc/start/experiments?

Our Katacoda Experiments scenario uses the new project. It's not as extensive as the document but there is enough room to play with the parameter values for experimentation, so yes, I can update the document quickly.

You mention there's no clear order in 9-12, but I wonder if we should have some order here. I could see an order like 9->12->10->11, where the tutorial first adds checkpoints: true to dvc.yaml, then adds checkpoints to the script in the most manual and language-agnostic way, then consolidates to use make_checkpoint, and then further consolidates/enhances to use dvclive.

It's possible to change the tags to reflect this order. I think I see the language agnostic way as a last resort, so I put it to the end. If we decide to merge checkpoints and this project, I can use this order. (BTW, I can create a similar R or Julia (or any other language) project for the signal file demo.)

I think these approaches are alternative to each other and if the user decides on one of them, they will stick to it. Branches like you used in the checkpoints project may make more sense. After we get the full pipeline, we either use checkpoint: true, or make_checkpoint() or a signal file, not all of them.

Thank you @dberenbaum

shcheklein commented 3 years ago

Good stuff @iesahin !

Quick question - does it make sense to reverse the order and start with a simple project, singe script (or even Jupyter?) and go into pipelines after that? To highlight the experiments first and "grow" into more advanced stuff? (It would contradict with branching though - we would need to keep applying the pipelines changes to every branch?

Some comments:

The source code and related files will be copied to this repository instead of downloading from S3. I don't think of a use case to put the files to S3 instead of a Github repository. Why was so?

since we start with git init in the get started flow, we needed a way to quickly "download" files

The new project repository will be https://github.com/iterative/dvc-get-started

is this project enough to completely replace the existing get started? then it makes sense. Otherwise we can rename both - put some prefix/suffix to both?

We can put 0-8 to the main branch and convert other tags to branches as there is no clear progress between these.

if we can do meaningful branches this is even better, we'll have a good features coverage for the DVC Studio. But on the other hand it might complicate the get started experience?

@dberenbaum @flippedcoder @dmpetrov what's your take so far - can this project and this approach replace/cover the existing mnist and be enough to cover the experiments + checkpoints.

iesahin commented 3 years ago

does it make sense to reverse the order and start with a simple project, singe script (or even Jupyter?) and go into pipelines after that? To highlight the experiments first and "grow" into more advanced stuff? (It would contradict with branching though - we would need to keep applying the pipelines changes to every branch?

That's certainly possible. I think we can have multiple branches and multiple tags in each of them, like

main branch same as the above
experiments branch ➡️ 01-exp-run ➡️ 02-exp-apply ➡️ 03-exp-branch ...
checkpoints branch ➡️ 01-checkpoint_true ➡️ 02-make-checkpoint ➡️ 03-signal-file ...
data-access branch ➡️ 01-dvc-get ➡️ 02-dvc-import-url ➡️ 03-dvc-import ...
data-versioning branch ➡️ 01-dvc-add ➡️ 02-dvc-push ....
cml branch ...

(experiments and checkpoints can be merged IMO, but that's not a strong opinion.)

I set the tags similar to the current GS project because it's easier to discuss and after that, we can use the same material for further improvements. This is basically v0, so it seems a better approach to use something we used to.

It will be a bit more difficult but we can have multiple generate scripts for each branch, using the same source and data with different DVC commands.

since we start with git init in the get started flow, we needed a way to quickly "download" files

I mean, they are already within code/ directory. Why don't we just copy them? :)

is this project enough to completely replace the existing get started? then it makes sense. Otherwise we can rename both - put some prefix/suffix to both?

0-8 is ready now. I'll test the whole GS material with this next week, also writing the modifications for the checkpoints. It's actually a matter of discussion and management to actually replace the current docs with this. I'm in favor of replacing, I can write a fully tested GS documentation and put these to CI to check the command changes, but you may think it's not worth it and I'm not particularly hard about this.

I'm putting dvc- prefix to all DVC-related repositories, because iterative has other projects not directly related with DVC but I'm completely open to suggestions and changes in that regard. Once decided, it won't change much.

if we can do meaningful branches this is even better, we'll have a good features coverage for the DVC Studio. But on the other hand it might complicate the get started experience?

I don't think having multiple storylines in the same repository makes it difficult to understand. We'll write about certain use cases in certain tutorials and won't mention the rest. Having a single repository with the well-defined dataset(s) will reduce the burden of telling our setup/data/source code every time, e.g., if a reader starts from the experiments tutorial but doesn't know how to init the DVC repository, we can put a link there to check that particular step in the docs.

BTW today is 20210428 that I used as a seed in params.yaml. The current project has 20170428 and I think it's an anniversary 🥳 🎉

@dberenbaum @shcheklein @flippedcoder @dmpetrov @jorgeorpinel

iesahin commented 3 years ago

The latest generate.bash version generates the repository and creates a push script. I've used it to generate https://github.com/iterative/dvc-get-started (that doesn't have -mnist as a suffix.)

dberenbaum commented 3 years ago

That's certainly possible. I think we can have multiple branches and multiple tags in each of them, like

* `main` branch same as the above

* `experiments` branch ➡️ `01-exp-run` ➡️ `02-exp-apply` ➡️ `03-exp-branch` ...

* `checkpoints` branch ➡️ `01-checkpoint_true` ➡️ `02-make-checkpoint` ➡️ `03-signal-file` ...

* `data-access` branch ➡️ `01-dvc-get` ➡️ `02-dvc-import-url` ➡️ `03-dvc-import` ...

* `data-versioning` branch ➡️ `01-dvc-add` ➡️ `02-dvc-push` ....

* `cml` branch ...

I like the idea of these different scenarios as different ways to get started, regardless of whether they are in the same repo or not.

For checkpoints, I don't think we need all these branches for getting started. I'd vote to pick one workflow (probably make_checkpoint or live).

iesahin commented 3 years ago

After yesterday's meeting and @dmpetrov 's comments, I thought of some more interesting projects for getting started:

We can have a transfer learning project that shows to download models, modify them and experiment on them, e.g., a project that downloads VGG-16, reuses for face recognition in a personal photo album might be more interesting for the general reader as well. A Model Access & Experimentation tutorial can be written with this. Another pipelines/dependencies tutorial that the user adds images to a directory and runs the tool for image classification might also look good.
It's possible to use GANs in checkpoints tutorial. GANs can be used to generate images that look like Van Gogh (or Kandinsky or whom they like) paintings. The user can compare the resulting pictures instead of a bunch of metrics.
An OCR pipeline that creates a searchable index of all the images. You save all the images in a directory and the pipeline indexes them after running OCR. A similar voice recognition pipeline for audio/podcast files may be interesting as well.
A video classifier that downloads videos from YouTube (or some public resource with permissible license or Elle's videos) as a dependency, splits into I-frames, and builds some index, using speech recognition and OCR. (On second thought, this could become a product on its own but let's leave it as an exercise. 😄 )
There can be a Data Science project with semi-structured data as well, like analyzing website statistics continuously or analyzing/predicting cryptocurrency prices. CML for BTC has a sales pitch these days.

All of these can be provided in Docker containers to try and can employ different DL/ML libraries as well. (I suppose we won't try to run them on Katacoda. :) )

Some of these take less time for me. I did face recognition, video OCR, handwriting/OCR before but all of them are doable.

@shcheklein @dberenbaum @jorgeorpinel @flippedcoder

dberenbaum commented 3 years ago

These are great ideas for projects! My take is that a fairly straightforward project with which most users are already familiar (like MNIST) works best for getting started, so that people can start using dvc as quickly as possible. However, more interesting projects are way better for other docs like in-depth tutorials, blog posts, etc.

jorgeorpinel commented 3 years ago

Hi. This discussion is getting long 🙂 Once it seems mostly resolved could we update the issue description with a summary of decisions? For now I added a subtask (to review docs examples using the resulting example repo).

iesahin commented 3 years ago

My take is that a fairly straightforward project with which most users are already familiar (like MNIST) works best for getting started, so that people can start using dvc as quickly as possible.

This was what we discussed with @shcheklein too. I'm just thinking out loud some possibilities. 💭 I'll stick to the simpler project for the introduction, then maybe we can write on some more interesting use cases. I agree. @dberenbaum

Once it seems mostly resolved could we update the issue description with a summary of decisions? For now I added a subtask (to review docs examples using the resulting example repo).

Thank you @jorgeorpinel I'm keeping it to sync the progress of the new GS project. For the longer run, it may be better to have a discussion in the docs repository. I'll close this after we decide on checkpoints' status.

iesahin commented 3 years ago

I added checkpoints features to the checkpoints branch. The tags are created sequentially but don't have to. I can remove the checkpoints branch and have only checkpoints- tags as well. base branch is supposed to be the basis of all other branches but I decided to use main-6-evaluate for checkpoints for a cleaner codebase. We can use it for future branches or merge it to main.

You can review the resulting repository at https://github.com/iterative/dvc-get-started

My preliminary tests are ok but I want to create docker containers for each of these tags and test thoroughly while writing READMEs. I'll use these containers also in the new Katacoda checkpoints scenario.

Please let me know if a Fashion-MNIST based experimentation/checkpoints branch is more favorable. I'll add it also to the dataset registry and use it for checkpoints.

Also, any comments about the naming of tags are welcome.

Thank you.

@jorgeorpinel @dberenbaum @shcheklein @flippedcoder

jorgeorpinel commented 3 years ago

I see it has both a main (default) branch, and a base branch.

Why not leave the default branch name as master? As per standard usage
On first glance I'm not getting the difference between the default one and base. Isn't the default branch the "base" of other ones too?

Maybe rename base to basic or common or make it a tag (e.g. 0-basic)

iesahin commented 3 years ago

🙏🏼 @jorgeorpinel

1. Why not leave the default branch name as `master`? As per standard usage

Github changed the default branch name to main for some time. I don't know if anyone really finds the name of master branch offensive but I'm not from those parts of the world. My other repositories are all default to master but this is an easy change. We can discuss this in #32 with @dberenbaum and @shcheklein .

On first glance I'm not getting the difference between the default one and base. Isn't the default branch the "base" of other ones too?

You're right. Initially main and checkpoints were both using base. I changed this in 8d2c31dcb6 as:

Screen Shot 2021-05-04 at 19 15 46

We can even have a single branch now. Your recommendation is also fine and we already have base-1-dvc-init and base-0-git-init tags. I can move these (including the checkpoints or not) under a single branch now.

jorgeorpinel commented 3 years ago

Github changed the default branch name to main for some time.

Interesting. I haven't created repos on GH in a long time then! Git's default is still master though. But OK whatever is the default in the iterative org should be good, then (p.s. we can change that default).

Agree on using the base tags (instead of a branch).

shcheklein commented 3 years ago

@iesahin it looks good to me!

A few questions/comments:

metrics in params looks a bit weird and feels unnecessary, wdyt? If we remove it- we can simplify the code, params files, etc.
dvc-get-started-mnist- could we get rid of that one?
it makes sense to do apply and do a commit in one of the branches with improved results and final metrics. In this case it's even good to have branches. But ideally it should be a bit more realistic - we complicate params file in the branch, we add checkpints, we add live, we improve metircs and capture live metrics we do commit. It will look way better in the Studio.
writing actual Get Started will drive certain requirements, I think we can start drafting it?

Bonus things:

can we make it look realistic - spread commits in time, use two people for different forks
can we capture more plots - e.g. confusion matrix, we have a template for this

main branch - let's adhere to the current GH defaults, it makes sense and we want to care about those folks this topic is important.

iesahin commented 3 years ago

* `metrics` in `params` looks a bit weird and feels unnecessary, wdyt? If we remove it- we can simplify the code, params files, etc.

Hmm, I couldn't decide what to put to metrics list there and put whatever seems relevant. Can I keep all metrics there?

it makes sense to do apply and do a commit in one of the branches with improved results and final metrics. In this case it's even good to have branches. But ideally it should be a bit more realistic - we complicate params file in the branch, we add checkpints, we add live, we improve metircs and capture live metrics we do commit. It will look way better in the Studio.

I can modify src/ and params.yaml per branch/tag. I tried to make modifications minimally across the tags but we can have completely different params/pipelines/source in these. I need to wrap my mind around to make features more visible.

writing actual Get Started will drive certain requirements, I think we can start drafting it?

First, I would like to update Katacoda scenarios with dvc-get-started. Then I'll update the relevant docs one-by-one, or draft new GS docs.

Our requirements will always change, I can adapt the project to all.

can we make it look realistic - spread commits in time, use two people for different forks

You want me to write a screenplay with Git history? 😆

Would you prefer fictional characters like Raul Owl <raul.owl@dvc.org> and Laurie Owlet <laurie.owlet@dvc.org>, or not so fictional names? 😄

I'll start the history of the main branch from DVC 1.0 release date and the checkpoints from DVC 2.0.

can we capture more plots - e.g. confusion matrix, we have a template for this

I put TP/FP/TN etc. metrics because of this. I'll take care of this.

shcheklein commented 3 years ago

Hmm, I couldn't decide what to put to metrics list there and put whatever seems relevant. Can I keep all metrics there?

yep, there are not many of them for now, should be fine to keep all of them

First, I would like to update Katacoda scenarios with dvc-get-started. Then I'll update the relevant docs one-by-one, or draft new GS docs.

👍

shcheklein commented 2 years ago

Looks like this is outdated.

iterative / example-repos-dev