iterative / example-repos-dev

Source code and generator scripts for example DVC projects
https://dvc.org/doc
21 stars 13 forks source link

get-started-X: Write `generate` scripts for each project #34

Closed iesahin closed 2 years ago

iesahin commented 3 years ago

TODO

Older Proposal Below.

We have a new Get Started project aimed towards experimentation features in DVC 2.0.

A step-by-step approach similar to the current example-get-started project is necessary for exposition. Creating the project from development sources to provide a clear history, without manual intervention also provides easier maintenance.

Tags

Discussion Points

dberenbaum commented 3 years ago

Thanks for the clear summary! A few quick thoughts:

iesahin commented 3 years ago

I'm a little unclear on the point of 9-cnn-model. Why convert to a CNN other than to match the existing dvc-checkpoints-mnist implementation?

The first 8 tags use an MLP model for quick iteration. It's good to expose data access & versioning, pipelines, params but not for the checkpoints I think. My intention was to provide a very simple model that we can run on Katacoda, and actually both MLP and CNN run on it. The CNN model is also written in Tensorflow and fits the rest of the pipeline. The change from MLP to CNN needs a single parameter change, in params.yaml. We can add more models in this fashion to models.py file.

Is there an equivalent to what is now in https://dvc.org/doc/start/experiments?

Our Katacoda Experiments scenario uses the new project. It's not as extensive as the document but there is enough room to play with the parameter values for experimentation, so yes, I can update the document quickly.

You mention there's no clear order in 9-12, but I wonder if we should have some order here. I could see an order like 9->12->10->11, where the tutorial first adds checkpoints: true to dvc.yaml, then adds checkpoints to the script in the most manual and language-agnostic way, then consolidates to use make_checkpoint, and then further consolidates/enhances to use dvclive.

It's possible to change the tags to reflect this order. I think I see the language agnostic way as a last resort, so I put it to the end. If we decide to merge checkpoints and this project, I can use this order. (BTW, I can create a similar R or Julia (or any other language) project for the signal file demo.)

I think these approaches are alternative to each other and if the user decides on one of them, they will stick to it. Branches like you used in the checkpoints project may make more sense. After we get the full pipeline, we either use checkpoint: true, or make_checkpoint() or a signal file, not all of them.

Thank you @dberenbaum

shcheklein commented 3 years ago

Good stuff @iesahin !

Quick question - does it make sense to reverse the order and start with a simple project, singe script (or even Jupyter?) and go into pipelines after that? To highlight the experiments first and "grow" into more advanced stuff? (It would contradict with branching though - we would need to keep applying the pipelines changes to every branch?

Some comments:

The source code and related files will be copied to this repository instead of downloading from S3. I don't think of a use case to put the files to S3 instead of a Github repository. Why was so?

since we start with git init in the get started flow, we needed a way to quickly "download" files

The new project repository will be https://github.com/iterative/dvc-get-started

is this project enough to completely replace the existing get started? then it makes sense. Otherwise we can rename both - put some prefix/suffix to both?

We can put 0-8 to the main branch and convert other tags to branches as there is no clear progress between these.

if we can do meaningful branches this is even better, we'll have a good features coverage for the DVC Studio. But on the other hand it might complicate the get started experience?

@dberenbaum @flippedcoder @dmpetrov what's your take so far - can this project and this approach replace/cover the existing mnist and be enough to cover the experiments + checkpoints.

iesahin commented 3 years ago

does it make sense to reverse the order and start with a simple project, singe script (or even Jupyter?) and go into pipelines after that? To highlight the experiments first and "grow" into more advanced stuff? (It would contradict with branching though - we would need to keep applying the pipelines changes to every branch?

That's certainly possible. I think we can have multiple branches and multiple tags in each of them, like

(experiments and checkpoints can be merged IMO, but that's not a strong opinion.)

I set the tags similar to the current GS project because it's easier to discuss and after that, we can use the same material for further improvements. This is basically v0, so it seems a better approach to use something we used to.

It will be a bit more difficult but we can have multiple generate scripts for each branch, using the same source and data with different DVC commands.

since we start with git init in the get started flow, we needed a way to quickly "download" files

I mean, they are already within code/ directory. Why don't we just copy them? :)

is this project enough to completely replace the existing get started? then it makes sense. Otherwise we can rename both - put some prefix/suffix to both?

0-8 is ready now. I'll test the whole GS material with this next week, also writing the modifications for the checkpoints. It's actually a matter of discussion and management to actually replace the current docs with this. I'm in favor of replacing, I can write a fully tested GS documentation and put these to CI to check the command changes, but you may think it's not worth it and I'm not particularly hard about this.

I'm putting dvc- prefix to all DVC-related repositories, because iterative has other projects not directly related with DVC but I'm completely open to suggestions and changes in that regard. Once decided, it won't change much.

if we can do meaningful branches this is even better, we'll have a good features coverage for the DVC Studio. But on the other hand it might complicate the get started experience?

I don't think having multiple storylines in the same repository makes it difficult to understand. We'll write about certain use cases in certain tutorials and won't mention the rest. Having a single repository with the well-defined dataset(s) will reduce the burden of telling our setup/data/source code every time, e.g., if a reader starts from the experiments tutorial but doesn't know how to init the DVC repository, we can put a link there to check that particular step in the docs.

BTW today is 20210428 that I used as a seed in params.yaml. The current project has 20170428 and I think it's an anniversary 🥳 🎉

@dberenbaum @shcheklein @flippedcoder @dmpetrov @jorgeorpinel

iesahin commented 3 years ago

The latest generate.bash version generates the repository and creates a push script. I've used it to generate https://github.com/iterative/dvc-get-started (that doesn't have -mnist as a suffix.)

dberenbaum commented 3 years ago

That's certainly possible. I think we can have multiple branches and multiple tags in each of them, like

* `main` branch same as the above

* `experiments` branch ➡️ `01-exp-run` ➡️ `02-exp-apply` ➡️ `03-exp-branch` ...

* `checkpoints` branch ➡️ `01-checkpoint_true` ➡️ `02-make-checkpoint` ➡️ `03-signal-file` ...

* `data-access` branch ➡️ `01-dvc-get` ➡️ `02-dvc-import-url` ➡️ `03-dvc-import` ...

* `data-versioning` branch ➡️ `01-dvc-add` ➡️ `02-dvc-push` ....

* `cml` branch ...

I like the idea of these different scenarios as different ways to get started, regardless of whether they are in the same repo or not.

For checkpoints, I don't think we need all these branches for getting started. I'd vote to pick one workflow (probably make_checkpoint or live).

iesahin commented 3 years ago

After yesterday's meeting and @dmpetrov 's comments, I thought of some more interesting projects for getting started:

All of these can be provided in Docker containers to try and can employ different DL/ML libraries as well. (I suppose we won't try to run them on Katacoda. :) )

Some of these take less time for me. I did face recognition, video OCR, handwriting/OCR before but all of them are doable.

@shcheklein @dberenbaum @jorgeorpinel @flippedcoder

dberenbaum commented 3 years ago

These are great ideas for projects! My take is that a fairly straightforward project with which most users are already familiar (like MNIST) works best for getting started, so that people can start using dvc as quickly as possible. However, more interesting projects are way better for other docs like in-depth tutorials, blog posts, etc.

jorgeorpinel commented 3 years ago

Hi. This discussion is getting long 🙂 Once it seems mostly resolved could we update the issue description with a summary of decisions? For now I added a subtask (to review docs examples using the resulting example repo).

iesahin commented 3 years ago

My take is that a fairly straightforward project with which most users are already familiar (like MNIST) works best for getting started, so that people can start using dvc as quickly as possible.

This was what we discussed with @shcheklein too. I'm just thinking out loud some possibilities. 💭 I'll stick to the simpler project for the introduction, then maybe we can write on some more interesting use cases. I agree. @dberenbaum

Once it seems mostly resolved could we update the issue description with a summary of decisions? For now I added a subtask (to review docs examples using the resulting example repo).

Thank you @jorgeorpinel I'm keeping it to sync the progress of the new GS project. For the longer run, it may be better to have a discussion in the docs repository. I'll close this after we decide on checkpoints' status.

iesahin commented 3 years ago

I added checkpoints features to the checkpoints branch. The tags are created sequentially but don't have to. I can remove the checkpoints branch and have only checkpoints- tags as well. base branch is supposed to be the basis of all other branches but I decided to use main-6-evaluate for checkpoints for a cleaner codebase. We can use it for future branches or merge it to main.

You can review the resulting repository at https://github.com/iterative/dvc-get-started

My preliminary tests are ok but I want to create docker containers for each of these tags and test thoroughly while writing READMEs. I'll use these containers also in the new Katacoda checkpoints scenario.

Please let me know if a Fashion-MNIST based experimentation/checkpoints branch is more favorable. I'll add it also to the dataset registry and use it for checkpoints.

Also, any comments about the naming of tags are welcome.

Thank you.

@jorgeorpinel @dberenbaum @shcheklein @flippedcoder

jorgeorpinel commented 3 years ago

I see it has both a main (default) branch, and a base branch.

  1. Why not leave the default branch name as master? As per standard usage
  2. On first glance I'm not getting the difference between the default one and base. Isn't the default branch the "base" of other ones too?

image

Maybe rename base to basic or common or make it a tag (e.g. 0-basic)

iesahin commented 3 years ago

🙏🏼 @jorgeorpinel

1. Why not leave the default branch name as `master`? As per standard usage

Github changed the default branch name to main for some time. I don't know if anyone really finds the name of master branch offensive but I'm not from those parts of the world. My other repositories are all default to master but this is an easy change. We can discuss this in #32 with @dberenbaum and @shcheklein .

On first glance I'm not getting the difference between the default one and base. Isn't the default branch the "base" of other ones too?

You're right. Initially main and checkpoints were both using base. I changed this in 8d2c31dcb6 as:

Screen Shot 2021-05-04 at 19 15 46

We can even have a single branch now. Your recommendation is also fine and we already have base-1-dvc-init and base-0-git-init tags. I can move these (including the checkpoints or not) under a single branch now.

jorgeorpinel commented 3 years ago

Github changed the default branch name to main for some time.

Interesting. I haven't created repos on GH in a long time then! Git's default is still master though. But OK whatever is the default in the iterative org should be good, then (p.s. we can change that default).

Agree on using the base tags (instead of a branch).

shcheklein commented 3 years ago

@iesahin it looks good to me!

A few questions/comments:

Bonus things:


main branch - let's adhere to the current GH defaults, it makes sense and we want to care about those folks this topic is important.

iesahin commented 3 years ago
* `metrics` in `params` looks a bit weird and feels unnecessary, wdyt? If we remove it- we can simplify the code, params files, etc.

Hmm, I couldn't decide what to put to metrics list there and put whatever seems relevant. Can I keep all metrics there?

it makes sense to do apply and do a commit in one of the branches with improved results and final metrics. In this case it's even good to have branches. But ideally it should be a bit more realistic - we complicate params file in the branch, we add checkpints, we add live, we improve metircs and capture live metrics we do commit. It will look way better in the Studio.

I can modify src/ and params.yaml per branch/tag. I tried to make modifications minimally across the tags but we can have completely different params/pipelines/source in these. I need to wrap my mind around to make features more visible.

writing actual Get Started will drive certain requirements, I think we can start drafting it?

First, I would like to update Katacoda scenarios with dvc-get-started. Then I'll update the relevant docs one-by-one, or draft new GS docs.

Our requirements will always change, I can adapt the project to all.

can we make it look realistic - spread commits in time, use two people for different forks

You want me to write a screenplay with Git history? 😆

Would you prefer fictional characters like Raul Owl <raul.owl@dvc.org> and Laurie Owlet <laurie.owlet@dvc.org>, or not so fictional names? 😄

I'll start the history of the main branch from DVC 1.0 release date and the checkpoints from DVC 2.0.

can we capture more plots - e.g. confusion matrix, we have a template for this

I put TP/FP/TN etc. metrics because of this. I'll take care of this.

shcheklein commented 3 years ago

Hmm, I couldn't decide what to put to metrics list there and put whatever seems relevant. Can I keep all metrics there?

yep, there are not many of them for now, should be fine to keep all of them

First, I would like to update Katacoda scenarios with dvc-get-started. Then I'll update the relevant docs one-by-one, or draft new GS docs.

👍

shcheklein commented 2 years ago

Looks like this is outdated.