iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.87k stars 1.19k forks source link

gc: requires loading parameters files #7585

Open itcarroll opened 2 years ago

itcarroll commented 2 years ago

Bug Report

Description

When "dvc.yaml" references a parameter file in order to reproduce a foreach stage, then dvc gc --all-commits will error if the parameter file does not exist in a commit containing this "dvc.yaml" version. The use case is not too convoluted: it involves a development version of a parameters file that is not committed, but used in "dvc.yaml" during development where it is easy to accidentally commit. @pared Suggested this may be a bug if I could reproduce.

Reproduceo

Initialize git and dvc. Add the following three files:

# dvc.yaml
vars:
- config.yaml.dev
stages:
  echo:
    foreach: ${country}
    do:
      cmd: echo ${item} > ${item}.txt
      outs:
      - ${item}.txt
# config.yaml.dev
country:
- fr
# .gitignore
config.yaml.dev

Run dvc repro then git add . && git commit -m "only commit".

To generate the following error, run dvc gc --all-commits -v and answer "y" at the prompt.

% dvc gc --all-commits -v
2022-04-18 14:23:59,329 WARNING: This will remove all cache except items used in the workspace and all git commits of the current repo.
Are you sure you want to proceed? [y/n]: y
2022-04-18 14:24:01,871 ERROR: failed to parse 'vars' in 'dvc.yaml': 'config.yaml.dev' does not exist: 'config.yaml.dev' does not exist
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/parsing/__init__.py", line 147, in __init__
    self.context.load_from_vars(*args, default=DEFAULT_PARAMS_FILE)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/parsing/context.py", line 456, in load_from_vars
    self.merge_from(fs, item, wdir)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/parsing/context.py", line 408, in merge_from
    ctx = Context.load_from(fs, abspath, select_keys)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/parsing/context.py", line 364, in load_from
    raise ParamsLoadError(f"'{file}' does not exist")
dvc.parsing.context.ParamsLoadError: 'config.yaml.dev' does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/cli/__init__.py", line 89, in main
    ret = cmd.do_run()
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/commands/gc.py", line 51, in run
    self.repo.gc(
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/gc.py", line 61, in gc
    for obj_ids in repo.used_objs(
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/__init__.py", line 420, in used_objs
    for odb, objs in self.index.used_objs(
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/index.py", line 218, in used_objs
    for stage, filter_info in pairs:
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/index.py", line 212, in <genexpr>
    self.stage_collector.collect_granular(
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/stage.py", line 401, in collect_granular
    return [StageInfo(stage) for stage in self.repo.index]
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/stage.py", line 401, in <listcomp>
    return [StageInfo(stage) for stage in self.repo.index]
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/index.py", line 92, in __iter__
    yield from self.stages
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/funcy/objects.py", line 28, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/index.py", line 74, in stages
    return self.stage_collector.collect_repo(onerror=onerror)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/stage.py", line 497, in collect_repo
    return list(self._collect_repo(onerror))
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/stage.py", line 480, in _collect_repo
    new_stages = self.load_file(file_path)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/stage.py", line 304, in load_file
    return self.load_all(path)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/stage.py", line 284, in load_all
    return [stages[key] for key in keys]
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/repo/stage.py", line 284, in <listcomp>
    return [stages[key] for key in keys]
  File "/usr/local/Cellar/python@3.10/3.10.2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/_collections_abc.py", line 878, in __iter__
    yield from self._mapping
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/stage/loader.py", line 152, in __iter__
    return iter(self.resolver.get_keys())
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/funcy/objects.py", line 28, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/stage/loader.py", line 39, in resolver
    return DataResolver(self.repo, wdir, self.data)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/parsing/__init__.py", line 149, in __init__
    format_and_raise(exc, "'vars'", self.relpath)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/parsing/__init__.py", line 80, in format_and_raise
    _reraise_err(ResolveError, message, from_exc=exc)
  File "/Users/icarroll/Library/Caches/pypoetry/virtualenvs/tmp-HwFH0Yo_-py3.10/lib/python3.10/site-packages/dvc/parsing/__init__.py", line 88, in _reraise_err
    raise err from from_exc
dvc.parsing.ResolveError: failed to parse 'vars' in 'dvc.yaml': 'config.yaml.dev' does not exist
------------------------------------------------------------
2022-04-18 14:24:01,881 DEBUG: Analytics is enabled.
2022-04-18 14:24:01,984 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/rb/ncv61h4j0c38v61pc_365rx40000gp/T/tmpqm3_769t']'
2022-04-18 14:24:01,986 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/rb/ncv61h4j0c38v61pc_365rx40000gp/T/tmpqm3_769t']'

Expected

It is not reasonable for dvc gc --all-commits to require a parameter file referenced from a committed dvc.yaml, when all it should need is a dvc.lock. It introduces the situation above where an accidental commit of a reference to a temporary parameters file permanently breaks the ability to garbage collect with --all-commits.

Environment information

Output of dvc doctor:

DVC version: 2.10.1 (pip)
---------------------------------
Platform: Python 3.10.2 on macOS-11.6.5-x86_64-i386-64bit
Supports:
    webhdfs (fsspec = 2022.3.0),
    http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
    https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git
karajan1001 commented 2 years ago

I'm just curious why would you leave config.yaml.dev untracked by Git. Because this would break your repository when you share it with others using Git. It makes this file be some kind of local parameter?

In a repo we have Git tracked files(code, repo config), DVC tracked files(model, data...etc), and files neither tracked by Git nor been tracked by DVC ( some sensitive config or local config?).

pared commented 2 years ago

I am not sure I would call it a bug. But the fact that we need to load params to clean up the cache sounds like not something necessary.

skshetry commented 2 years ago

@pared @itcarroll, DVC treats dvc.yaml as a single source of truth. So the file paths that are being tracked are read from dvc.yaml rather than from dvc.lock. Later it just reads the hashes for those file-paths in dvc.lock and builds a index, which it will use to delete files (in case of gc). In a way, dvc.lock is a database for metadata for entries in dvc.yaml.

This allows us to maintain a simpler and unified model where we can treat dvc.yaml and dvc.lock entries as being one, and avoid need for reconciliation.

itcarroll commented 2 years ago

@karajan1001 Yes, maybe ".local" would be a better suffix than ".dev"

itcarroll commented 2 years ago

Given what @skshetry said, I realize there's a more straightforward way to hit this same error. Simply forget to add your parameters file to the commit where you add a reference to it in the dvc.yaml file. Now you've broken gc --all-commits with no way to fix it by adding a new commit. I'd call that a bug, but up to y'all.

dberenbaum commented 2 years ago

Related issues/discussions:

pared commented 2 years ago

@dberenbaum I guess to some extend #6150 ?