iterative / example-repos-dev

Source code and generator scripts for example DVC projects
https://dvc.org/doc
21 stars 13 forks source link

`example-get-started`: cache images #249

Closed dberenbaum closed 9 months ago

dberenbaum commented 10 months ago

It's better to cache images I think. It creates a stronger example on how things could be managed (I got a question today if images could be cached or not). It's better since we cover a bit more advanced scenario + probably ppl will use it this way.

Originally from https://github.com/iterative/example-repos-dev/pull/248#issue-1876465827

dberenbaum commented 10 months ago

The only thing to consider is that we need to add it as a pipeline output, which I don't think is a bad idea. We can either do -o eval for the entire dir or -o eval/plots and we could either add or leave out -O eval/metrics.json. My bias is to track all the outputs rather than leave some out; it's easier to explain and keeps the metrics tied to the model training.

shcheklein commented 10 months ago

Check also if we need to pass report=None https://github.com/iterative/example-repos-dev/pull/248#discussion_r1314899146

dberenbaum commented 10 months ago

Check also if we need to pass report=None #248 (comment)

Since we don't use the context manager and instead explicitly call live.make_dvcyaml(), I don't think it should be an issue (cc @daavoo). live.make_report() isn't being implicitly called anywhere AFAICT. Also https://github.com/iterative/dvclive/pull/684 would make report=None the default so this seems like a very short-term problem.

daavoo commented 10 months ago

Check also if we need to pass report=None #248 (comment)

Since we don't use the context manager and instead explicitly call live.make_dvcyaml(), I don't think it should be an issue (cc @daavoo). live.make_report() isn't being implicitly called anywhere AFAICT.

Context manager is used:

https://github.com/iterative/example-repos-dev/blob/a9c9a652725b5bf43e29d3295a1b5b10073710f1/example-get-started/code/src/evaluate.py#L103

I guess it was just not deployed yet

dberenbaum commented 10 months ago

🤦 I must have looked on my phone or something and completely missed the changes to evaluate.py in that PR. Closing this issue.

We could open a separate one to set report=None, but it's likely a short-term problem as explained above.

dberenbaum commented 10 months ago

Still seems like the pipeline fails to cache the images, though, right?

dberenbaum commented 10 months ago

do -o eval for the entire dir

I increasingly feel this would be the best default for dvclive, both with and without pipelines. It requires a DVC remote, but I think it's needed in almost every real scenario, and Git-only workflows are likely not helping with onboarding since I have found most people are confused by the idea of tracking dvclive with Git. Caching all of dvclive simplifies the workflow. Without pipelines, it's easy to log any size files and know that everything logged by dvclive is saved in the same place. With pipelines, it makes the stage definition simple since you just add dvclive as a stage output.

shcheklein commented 10 months ago

btw, we could try to utilize .dvcignore to "ignore" the dvc.yaml inside the eval / dvclive dir?

shcheklein commented 10 months ago

I haven't had time to react to the PR review and push the repo, folks. Yes, I also saw that CI failed. I'll try to take a look today.

For this repo specifically, I wanted to have a mix tbh - some metrics in Git, some outside (this repo is more than just experiments).

shcheklein commented 10 months ago

But may be you are right, Dave and we should just track the whole directory there.

dberenbaum commented 10 months ago

btw, we could try to utilize .dvcignore to "ignore" the dvc.yaml inside the eval / dvclive dir?

No, it doesn't help unfortunately. I tried it when iterating on https://github.com/iterative/dvclive/issues/456.

For this repo specifically, I wanted to have a mix tbh - some metrics in Git, some outside (this repo is more than just experiments).

What is the benefit at this point? I think seeing the metrics in Git is a very limited benefit these days (mostly I expect people use vs code/studio/cli to see metrics) and some users have complained about the noise it generates.

dberenbaum commented 10 months ago

I haven't had time to react to the PR review and push the repo, folks. Yes, I also saw that CI failed. I'll try to take a look today.

I don't think it's worth spending time on it until we decide on this discussion since it likely has to change to cache the images inside the pipeline.

shcheklein commented 10 months ago

What is the benefit at this point?

Primarily for demoing things.

E.g. A simple file like metrics.json - could use GH/GL interface to diff things. You could create a CI action w/o setting up AWS credentials (annoying and 10x complicated).

dberenbaum commented 10 months ago

We could keep an option to disable the cache and make it apply to the entire dir for those kinds of scenarios

shcheklein commented 10 months ago

So, what's your take? What should we use as a default? All remote, one single dir?

I can try to update the repo today ...

dberenbaum commented 10 months ago

Yes, that's my take. WDYT? I'm wondering if we should also cache everything without pipelines by default in dvclive 3.0.