iterative / dvclive

📈 Log and track ML metrics, parameters, models with Git and/or DVC
https://dvc.org/doc/dvclive
Apache License 2.0
161 stars 33 forks source link

What is the proper way of enabling intermediate checkpointing when using pytorch lightning+dvc+dvclive? #826

Closed QazyBi closed 1 month ago

QazyBi commented 1 month ago

Hello! I am grateful for the amazing tool you're providing. After reading the documentation I couldn't figure out what is the proper way of enabling intermediate checkpoint logging in dvc and lightning? My goal is to be able to save multiple checkpoint in one run.

My dvc.yaml file:

stages:
  train:
    cmd: python src/entry.py --train
    deps:
    - src/entry.py
    outs:
    - DvcLiveLogger
metrics:
- dvclive/metrics.json
plots:
- dvclive/plots/metrics:
    x: step
artifacts:
  best:
    path: dvclive/artifacts/model.ckpt
    type: model

Lightning checkpoint callback that saves checkpoint every epoch:

checkpoint_callback_periodic = ModelCheckpoint(
        dirpath="DvcLiveLogger/",
        filename="model",
        save_top_k=-1,
        every_n_epochs=1,
        auto_insert_metric_name=False
    )

dvc_logger = DVCLiveLogger(
    dir='dvclive',
    log_model=True
)

trainer = Trainer(
    max_epochs=5,
    logger=dvc_logger,
    callbacks=[checkpoint_callback_periodic],
    enable_checkpointing=True,
    log_every_n_steps=1,
)

Output of the dvc exp show image

But I was expecting something like this which was available in dvc 2.0: image source

Output of the dvc doctor

DVC version: 3.50.2 (conda)
---------------------------
Platform: Python 3.10.14 on Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.5
Supports:
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.5.0, boto3 = 1.34.106)
Config:
        Global: /home/qb/.config/dvc
        System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p6
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/nvme0n1p6
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/da492d0d75e1df64806fd63eb1f23b42
dberenbaum commented 1 month ago

Hi @QazyBi!

Checkpoints as described in that blog post were deprecated in dvc 3 since they were often cumbersome and unnecessary. You can still access all your checkpoints in dvc 3 with dvclive and lightning, but you will not see them displayed like in that example. The checkpoints themselves are still preserved in DvcLiveLogger (and the best ones are copied to dvclive/artifacts), and the metrics of each step are available as plots in dvclive/plots.

QazyBi commented 1 month ago

Thank you for the quick response! I appreciate it. I've noticed the checkpoints in the DvcLiveLogger folder, which is great.