iterative / example-repos-dev

Source code and generator scripts for example DVC projects
https://dvc.org/doc
21 stars 13 forks source link

Adding `example-dvc-checkpoints` #77

Closed iesahin closed 2 years ago

iesahin commented 3 years ago

This replaces #45 and creates a new project named example-dvc-checkpoints instead of get-started-checkpoints. The new project is based on example-dvc-experiments

iesahin commented 3 years ago

This is ready for the review. Pushed the repository to https://github.com/iterative/dvc-example-checkpoints-tensorflow. There are some warnings in the push script and may need some polish.

@dberenbaum @shcheklein @casperdcl @jorgeorpinel

casperdcl commented 3 years ago

possibly also could replace entirety of https://github.com/iterative/dvc-checkpoints-mnist? Considerations:

iesahin commented 3 years ago
* with & without dvclive

It may be worthwhile to have a separate example-dvclive project @casperdcl @dberenbaum @daavoo .

I can remove the live-keras branch from the current project then.

See #81

iesahin commented 3 years ago
* with & without max num of checkpoints

You mean fixed number of epochs? In this implementation dvc exp run -S train.epochs=0 runs the experiments indefinitely. Otherwise it runs the specified number of epochs without resuming the previous model.

iesahin commented 3 years ago
* framework: tensorflow vs pytorch

I also like PyTorch more and would prefer it, but it doesn't run on Katacoda (and possibly on other virtual servers.) It's possible to have a separate example-dvc-checkpoints-pytorch project though.

dberenbaum commented 3 years ago

Also cc @daavoo

We are trying to make this the "happy path," right?

  • with & without dvclive

I think it should include dvclive. This is the easiest for users, and we can bundle it with dvc if we want to avoid asking users to install a separate package. We can create a separate repo for dvclive to show different frameworks, but that shouldn't change that we should use it here by default.

  • with & without max num of checkpoints

I vote for a set number of checkpoints. The indefinite training workflow has caused lots of confusion and isn't what most users are accustomed to. Indefinite training can be touted elsewhere as a nice benefit that checkpoints enable.

  • CML: auto-pushing checkpoints for workflow run recovery

Since this is configured by environment variable, it should be easy to change this. I'm not sure where it would push to if we keep it turned on by default?

  • framework: tensorflow vs pytorch

No strong opinion on this one.

One advantage of tensorflow is that keras makes the examples really easy. On the other hand, the DVCLiveCallback for keras might hide too much basic functionality that users will need to understand for unsupported frameworks.

daavoo commented 3 years ago

I think it should include dvclive. This is the easiest for users, and we can bundle it with dvc if we want to avoid asking users to install a separate package. We can create a separate repo for dvclive to show different frameworks, but that shouldn't change that we should use it here by default.

Agree. Probably just leave the already implemented live-keras here (reasonable default framework) and just make the separate example-dvclive-{ML-Framework} build from that point, similar to how we are building here from the get-started-experiments, right?

No strong opinion on this one.

One advantage of tensorflow is that keras makes the examples really easy. On the other hand, the DVCLiveCallback for keras might hide too much basic functionality that users will need to understand for unsupported frameworks.

But if this repo is "the happy path" that's not really an issue, right? Low-level integrations where basic dvclive functionality is explained could be handled in the dvclive example repo mentioned above.

dberenbaum commented 3 years ago

But if this repo is "the happy path" that's not really an issue, right? Low-level integrations where basic dvclive functionality is explained could be handled in the dvclive example repo mentioned above.

Good point. Maybe we need to define "happy path" 🤣 . I think this example should be the one that provides the most users with sufficient value to get started with checkpoints and feel they are getting some benefit from it.

What proportion of users do we think can leverage one of the dvclive callbacks? Obviously, we aim to provide integrations for the most common frameworks, but there are so many (and so many higher level ones built on top of tensorflow and pytorch) that I don't have a great idea of whether the majority would use the integrations, and bare tensorflow and pytorch don't have callbacks. If most users can get started with one of the callbacks, then keras makes sense. If most users have to work outside of those, then a more manual logging method makes sense.

iesahin commented 3 years ago

An idea we discussed with @shcheklein is to merge the checkpoints with the experiments project. I created #84 for this.

It's possible to dedicate this project to dvclive with a single branch and merge python-api branch to the experiments project. We can rename this to example-dvclive-keras to avoid confusion about the other ML frameworks.

We can implement signal-file in an R/Julia project in the future.

The basic use case is not recommended anyway.

WDYT? @dberenbaum @daavoo @casperdcl

daavoo commented 3 years ago

What proportion of users do we think can leverage one of the dvclive callbacks? Obviously, we aim to provide integrations for the most common frameworks, but there are so many (and so many higher level ones built on top of tensorflow and pytorch) that I don't have a great idea of whether the majority would use the integrations, and bare tensorflow and pytorch don't have callbacks. If most users can get started with one of the callbacks, then keras makes sense. If most users have to work outside of those, then a more manual logging method makes sense.

I would expect most people to use high-level frameworks and dvclive callback integrations. So keras makes sense to my. That is my intuition given how some high-level frameworks like https://github.com/huggingface/transformers are taking all over previous frameworks.

I don't really see why an academic researcher would use bare TensorFlow or PyTorch for the training loop nowadays given how in every high-level framework you can customize pretty much every logic block. It feels even less likely for a "production" ml practitioner.

Even if there are "a lot" of high-level framework alternatives, the are only a handful of really used ones, and adding support to each one doesn't take a lot of effort tbh (except for some frameworks where a few gotchas might be hidden).

I would dare to say that the existing list (https://dvc.org/doc/dvclive/user-guide/ml-frameworks) + Pytorch Lighting would already cover the vast majority of practical users. Some task-specific libraries that opt for writing everything without a framework, like YoloV5, might be the exception

dberenbaum commented 3 years ago

@iesahin I'm a bit lost on the organization of the various repos. If we can merge checkpoints with the experiments repo, that's great, but in that case I would apply all of the ideas from this discussion to that repo. I have been assuming that this discussion applies to the primary checkpoints example, wherever that ends up being.

Edit: I agree that we shouldn't worry about other branches to address non-Python or basic/no-code checkpoints.

iesahin commented 3 years ago

If we can merge checkpoints with the experiments repo, that's great, but in that case I would apply all of the ideas from this discussion to that repo.

Yes, that's what we aim. Instead of example-dvc-checkpoints with 4 branches, we thought it's possible to select one of these branches (python-api) and add it to example-dvc-experiments repository in a tag after I created this issue.

The issue we can discuss the merge is #84.

I can close this one if we agree on following that path.

dberenbaum commented 3 years ago

Yes, that's what we aim. Instead of example-dvc-checkpoints with 4 branches, we thought it's possible to select one of these branches (python-api) and add it to example-dvc-experiments repository in a tag after I created this issue.

Sounds good, except that based on the discussion above, can we select live-keras instead of python-api?