iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.95k stars 1.19k forks source link

exp list: does not show new experiments #9260

Open mstrupp opened 1 year ago

mstrupp commented 1 year ago

Bug Report

Description

When the terminal is killed while dvc exp run is executing, the ref .git/refs/exps/exec/EXEC_BASELINE is not removed. Then when a git commit is made, git might pack the references to optimize performance. Now, dvc exp list is stuck with the list of experiments before the commit and will not update when new experiments are run.

This also affects the experiments table in the vscode extension.

Reproduce

  1. git init
  2. dvc init
  3. dvc stage add -n prepare -d prepare.py python prepare.py
  4. create file prepare.py and write a program that takes some time (e.g. time.sleep(10))
  5. git add .
  6. git commit -m "commit 1"
  7. dvc exp run
  8. while running: Kill the terminal (not via ctrl+c but by closing the terminal)
  9. edit prepare.py (to make dvc exp run execute the pipeline again)
  10. git add .
  11. git commit -m "commit 2"
  12. git pack-refs --all: when committing, git sometimes does "git pack-refs" for optimization. It can happen right here. To simulate the automatic packing, run git pack-refs --all
  13. dvc exp run
  14. dvc exp list

    Expected

    dvc exp list should show the experiment from 13. Instead, it returns nothing. It only shows the experiment with dvc exp list -A

    Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.38.1 (exe)
---------------------------------
Platform: Python 3.10.9 on Windows-10-10.0.19045-SP0
Subprojects:

Supports:
        azure (adlfs = 2022.11.2, knack = 0.10.1, azure-identity = 1.12.0),
        gdrive (pydrive2 = 1.15.0),
        gs (gcsfs = 2022.11.0),
        hdfs (fsspec = 2022.11.0, pyarrow = 10.0.1),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.1.0),
        s3 (s3fs = 2022.11.0, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2022.11.0)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: NTFS on C:\
Repo: dvc, git
daavoo commented 1 year ago

Hi @mstrupp , could you try upgrading to the latest DVC version?

mstrupp commented 1 year ago

Hi @daavoo, thank you for the response. I upgraded dvc but the problem still exists.

$ dvc doctor
DVC version: 2.51.0 (pip)
-------------------------
Platform: Python 3.10.8 on Windows-10-10.0.19045-SP0
Subprojects:
        dvc_data = 0.44.1
        dvc_objects = 0.21.1
        dvc_render = 0.3.1
        dvc_task = 0.2.0
        scmrepo = 0.1.17
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: NTFS on C:\
Repo: dvc, git
Repo.site_cache_dir: C:\ProgramData\iterative\dvc\Cache\repo\5db899e06b13bbca5a630f6ac0c2cbfd
pmrowla commented 1 year ago

The workaround here would be to remove the exec ref with

git update-ref -d refs/exps/exec/EXEC_BASELINE

The issue is that we have logic to account for when HEAD has moved during experiment execution, where exp show will then show experiments derived from EXEC_BASELINE instead of HEAD. We could consider updating the logic to check and see if there is also an active workspace run (and cleanup the ref when there is not), but this would also introduced additional overhead into every dvc command that uses resolve_rev.

dberenbaum commented 1 year ago

@pmrowla Is it needed for anything besides exp list and exp show? Can we do it only in those commands?

pmrowla commented 1 year ago

@dberenbaum it's needed for every DVC command that has any kind of parameter that can be set to (or defaults to) HEAD (so any diff/show command)

Should also note that if we drop checkpoints support we could also consider just dropping this behavior as well. HEAD is still moved for regular experiments but we restore it shortly afterwards when the experiment run ends. The main issue here is that for checkpoints, HEAD is moved to the most recently generated checkpoint commit. (We may not actually be able to drop this entirely though since tools like vscode could still try to run DVC commands before HEAD is restored at the end of a regular exp run)

mstrupp commented 1 year ago

Thanks for the suggested workaround @pmrowla.

Unfortunaly, the user doesn't realize when the problem occurs and the workaround should be applied. DVC happily shows the experiments before EXEC_BASELINE. The user expects to see the new experiments but never realizes why they are not shown.