iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.96k stars 1.19k forks source link

queue status: unexpected error - invalid commit #9900

Open aschuh-hf opened 1 year ago

aschuh-hf commented 1 year ago

Bug Report

Description

I queued and started a couple experiments. Some of them failed for unknown reason though the only difference to successfully started experiments is a loss weight parameter. I wanted to check the logs / status of those runs. This is when I encountered the following error.

Reproduce

$ dvc queue status -v
2023-09-01 02:07:01,682 DEBUG: v3.17.0 (conda), CPython 3.10.6 on Linux-3.10.0-1127.8.2.el7.x86_64-x86_64-with-glibc2.17
2023-09-01 02:07:01,683 DEBUG: command: /opt/conda/envs/venv/bin/dvc queue status -v
2023-09-01 02:07:04,867 ERROR: unexpected error - Invalid commit '968fbc500b9115885c29771f3c2999a272cb409b'
Traceback (most recent call last):
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/pygit2/repository.py", line 322, in resolve_refish
    reference = self.lookup_reference_dwim(refish)
KeyError: '968fbc500b9115885c29771f3c2999a272cb409b'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/scmrepo/git/backend/pygit2/__init__.py", line 399, in resolve_commit
    commit, _ref = self._resolve_refish(rev)
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/scmrepo/git/backend/pygit2/__init__.py", line 121, in _resolve_refish
    commit, ref = self.repo.resolve_refish(refish)
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/pygit2/repository.py", line 325, in resolve_refish
    commit = self.revparse_single(refish)
KeyError: '968fbc500b9115885c29771f3c2999a272cb409b'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/cli/__init__.py", line 209, in main
    ret = cmd.do_run()
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/commands/queue/status.py", line 18, in run
    result = self.repo.experiments.celery_queue.status()
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 222, in status
    result.extend(
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 223, in <genexpr>
    _format_entry(queue_entry, status="Failed")
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 210, in _format_entry
    "timestamp": _get_timestamp(entry.stash_rev),
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/repo/experiments/queue/base.py", line 194, in _get_timestamp
    commit = self.scm.resolve_commit(rev)
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/scmrepo/git/__init__.py", line 292, in _backend_func
    result = func(*args, **kwargs)
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/scmrepo/git/backend/pygit2/__init__.py", line 401, in resolve_commit
    raise SCMError(f"Invalid commit '{rev}'")
scmrepo.exceptions.SCMError: Invalid commit '968fbc500b9115885c29771f3c2999a272cb409b'

2023-09-01 02:07:07,452 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)

Expected

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.17.0 (conda)
---------------------------
Platform: Python 3.10.6 on Linux-3.10.0-1127.8.2.el7.x86_64-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 2.15.4
        dvc_objects = 1.0.1
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.3.1
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.26.76)
Config:
        Global: /home/aschuh/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: xfs on /dev/sda1
Caches: local
Remotes: s3, s3
Workspace directory: xfs on /dev/sda1
Repo: dvc (subdir), git
Repo.site_cache_dir: /var/tmp/dvc/repo/8d2f5d68bb223da9776a9d6301681efd

Additional Information (if any):

hwong557 commented 1 year ago

I'm getting the same error on version 3.22.1.

NLogSpace commented 3 months ago

Any updates on this? I have the same error on version 3.47.0.

dberenbaum commented 3 months ago

@NLogSpace Are you able to show a reproducible example?

giulatona commented 1 month ago

I had the same issue after queueing and then removing some experiments. I had to manually remove the files .dvc/tmp/exps/celery/broker/processed/.celery.msg and then dvc queue status worked again.

hkariti commented 3 weeks ago

Had the same issue and noticed that removing old experiments using dvc queue remove --success fixed it. Maybe commits of old experiments were removed by git gc, causing this issue?