Model Checkpoint Callback error (torch lighning)

thisiswhereitype commented 1 year ago

Hi - when running a model I am getting a warning when saving a model. Running dvc repro model_train, could the exeception handling be improved here? I had to track it in the source.

https://github.com/iterative/dvclive/blob/e8d008eb5f2b2632786afd3b27897e57b7867e43/src/dvclive/live.py#L493-L495

WARNING:dvclive:Failed to dvc add .../DvcLiveLogger/AE_embed_v1/checkpoints: cannot update 'checkpoints': not a data source

from dvclive.lightning import DVCLiveLogger
...
dvclive_logger = DVCLiveLogger(
    f"{model}_embed_v1", prefix=model, dir=f"dvclive/{model}", log_model=True
)

  - stages:
    - train.model:
      metrics:
        - dvclive/AE
      outs:
        - DvcLiveLogger/AE_embed_v1/

dvc doctor:

$ dvc doctor
DVC version: 2.58.2 (conda)
---------------------------
Platform: Python 3.10.11 on Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 0.51.0
        dvc_objects = 0.25.0
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.2.1
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3)
Config:
        Global: ~/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: 9p on D:\
Caches: local
Remotes: local
Workspace directory: 9p on D:\
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/dd3f8e5f1731970476b1faa6cc8caf7a

dberenbaum commented 1 year ago

How do you run the code? Do you use dvc repro, dvc exp run, invoke the Python script directly?

thisiswhereitype commented 1 year ago

dvc repro with a cmd: python <lib>/module.py

dberenbaum commented 1 year ago

Any reason you use dvc repro over dvc exp run? dvc exp run should work as a drop-in replacement, and not all DVCLive features work as intended (including some of the warnings/errors) with dvc repro, although we will work on fixing that as much as we can.

thisiswhereitype commented 1 year ago

I didn't understand from reading the docs that DVCLive should be used as a exp

thisiswhereitype commented 1 year ago

Although when I was trying with dvc exp and for a grid search I found I couldn't visualise the exp's easily, and the VScode studio was locking the git index

shcheklein commented 1 year ago

I couldn't visualise the exp's easily, and the VScode studio was locking the git index

this part should be significantly improved recently in the VS Code DVC extension. Could you give it a try again please?

thisiswhereitype commented 1 year ago

I realised I misunderstood the Hydra config, the exp commands seemed to run -S sweeps, so maybe a full Hydra conf/ is optional? Either way I setup params now I have returned to exp. My aim is to run experiments based upon pre-existing stages, please could you sanity check my thinking:

Each stage instantiates a Live object
However the stages already handle outputs such that they can be use in a foreach, but this seemed more appropriate a. The experiment complains that artefacts are already indexed/cached, by the inital stage - should I use --temp/--queue? b. Also what are valid log_artefact names? seemingly name0200 is not valid?
In my first use I setup params.yaml directly - but running exp seems to revert this even with new Hydra conf that should be generated.

$tree conf

``` $ tree conf conf ├── config.yaml ├── nn │ ├── embed │ │ ├── ae.yaml │ │ └── vae.yaml │ └── train │ └── run.yaml └── tsne └── run.yaml ```

thisiswhereitype commented 1 year ago

https://github.com/iterative/dvclive/issues/670#issuecomment-1705140336 I see here that reproing exps is a no and instead exp run or a standalone Live are the two options.

dberenbaum commented 1 year ago

the exp commands seemed to run -S sweeps, so maybe a full Hydra conf/ is optional?

Correct, hydra is used as dvc's backend for parsing parameters by default, and optionally you can enable it to use a conf/ directory per https://dvc.org/doc/user-guide/experiment-management/hydra-composition.

Each stage instantiates a Live object

Makes sense so far.

2. However the stages already handle outputs such that they can be use in a foreach, but this seemed more appropriate

2. a. The experiment complains that artefacts are already indexed/cached, by the inital stage - should I use --temp/--queue?

Not sure I follow. What's the message you get?

2. b. Also what are valid log_artefact names? seemingly name0200 is not valid?

Hmm, that should be a valid name, and I'm able to log an artifact with that name. Can you show the error you get? The rules for artifact names are explained in https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#artifacts.

3. In my first use I setup params.yaml directly - but running exp seems to revert this even with new Hydra conf that should be generated.

If you use conf/ and set hydra.enabled per https://dvc.org/doc/user-guide/experiment-management/hydra-composition, each experiment will use the conf/ directory values to override params.yaml. If you disable hydra.enabled, you can disable this behavior.

thisiswhereitype commented 1 year ago

Okay so I had mistyped my conf/config.yaml and worked this out by using -vv which shows what was searching when processing ${}s!

If you use conf/ and set hydra.enabled per https://dvc.org/doc/user-guide/experiment-management/hydra-composition, each experiment will use the conf/ directory values to override params.yaml. If you disable hydra.enabled, you can disable this behavior.

So a params.yaml is poulated, based upon the defaults, on any run (ignoring any -S) and I should check this into git? Using any dvc exp run <stage> will apply this to head unless --queued or told otherwise.

Looks like I had misunderstood a lot! Just reproing now

dberenbaum commented 1 year ago

So a params.yaml is poulated, based upon the defaults, on any run (ignoring any -S) and I should check this into git?

Almost. -S isn't ignored. It will still override the conf/ defaults, just like hydra would normally do. Think of it like using hydra without dvc but you are only using it to write the contents of params.yaml.

Using any dvc exp run <stage> will apply this to head unless --queued or told otherwise.

👍

dberenbaum commented 1 year ago

Feel free to suggest how to improve the docs

thisiswhereitype commented 1 year ago

Thanks so far experiments are running with parameters I want and stages can be repro'd and get their from params.yaml accordingly. Hydra manages my parameter generating for experiments.

So my dilemma is I have stages which run, however I would now like to experiment with them tweaking parameters and such.

If start logging things in callbacks for via a Live object manipulated through callbacks logging things like current step and loss these are logged by: Live(dir='dvclive/alg.tsne', save_exp=False).

The script already logs to an outdir in the dvc stage below.

What do I tell my dvc.yaml about them?
Also if I ever run dvc repro I will just need to manually apply the associated exp result to head?

stages:
# assume correct conda env
  alg.tsne:
      cmd: python -m energy_ts.alg.tsne.run
      frozen: false
      params:
        - tsne
      deps: # these are defined above to trigger reruns.
      - ${core} 
      - ${data}
      - ${etl}
      - energy_ts/alg/tsne/run.py
      outs:
      - data/alg.tsne

dberenbaum commented 1 year ago

What do I tell my dvc.yaml about them?

If you are using dvclive>=3.0, I would recommend you add that dvclive directory to the outs section so it is tracked as an output of that stage. If you want to keep them in Git like they are now instead of in DVC remote storage, you should also include cache: false. See the examples here.

2. Also if I ever run dvc repro I will just need to manually apply the associated exp result to head?

The experiment result should be applied to your workspace, but it would be up to you to save it to Git so you can recover it later. I'm not sure there's much benefit to running dvc repro over dvc exp run here, but if you feel it's needed, feel free to explain your scenario.

I'm going to close this issue since it seems the initial question have been resolved, but feel free to comment or open a new one if you still have questions.

thisiswhereitype commented 1 year ago

I'm going to close this issue since it seems the initial question have been resolved, but feel free to comment or open a new one if you still have questions.

Thanks! Yes I think I understood now.

feel free to explain your scenario.

Muscle memory from using DVC this way pre v1 🥲

iterative / dvclive

Model Checkpoint Callback error (torch lighning) #657