iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.87k stars 1.19k forks source link

`exp run`: doesn't work with submodule as dependency #7186

Open woodshop opened 2 years ago

woodshop commented 2 years ago

Bug Report

Description

dvc exp run fails and dvc repro runs successfully when a cmd is executed from inside of a submodule and the submodule is included as a dependency.

Reproduce

Set Up:

mkdir test
cd test

mkdir submodule
cd submodule
git init
echo "cp ../data/a.txt ../models/b.txt" > run.sh
git add run.sh
git commit -m "initial commit"

cd ..
mkdir main
cd main
git init
dvc init
dvc config cache.type=hardlink,symlink,copy
git submodule add ../submodule src
mkdir data
echo "test" > data/a.txt
mkdir models
dvc stage add -n cp -w src -d ../src -d ../data/a.txt -o ../models/b.txt bash run.sh
dvc repro
git add .
git commit -m "initial commit"

This works dvc repro -f

This fails: dvc exp run -f

This also fails: dvc exp run -f --temp

Expected

dvc repro and dvc exp run run and succeed similarly.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 2.9.2 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.11.0-1021-aws-x86_64-with-glibc2.29
Supports:
    hdfs (fsspec = 2021.10.1, pyarrow = 5.0.0),
    webhdfs (fsspec = 2021.10.1),
    http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
    https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
    s3 (s3fs = 2021.10.1, boto3 = 1.17.106)
Cache types: hardlink, symlink
Cache directory: lustre on 172.16.38.30@tcp:/skl3jbmv
Caches: local
Remotes: s3
Workspace directory: lustre on 172.16.38.30@tcp:/skl3jbmv
Repo: dvc, git

Additional Information (if any): There's an error in the verbose output that indicates

ERROR: unexpected error - invalid data in index - invalid entry
woodshop commented 2 years ago

Wondering if this is related to an earlier resolved issue that I posted about. https://github.com/iterative/dvc/issues/5740

daavoo commented 2 years ago

@pmrowla ping

pmrowla commented 2 years ago

This is definitely a bug for the --temp/--queue use case, we don't handle submodules at all there. We probably need to be doing submodule pull/update in the temp workspaces that we create.

I'm not sure why exp run does not work for workspace runs though, as we should not have to do anything special there, since the submodule is already set up properly.

@woodshop it would help if you could provide the entire error traceback from the -v command output, the error message alone doesn't provide us with enough information to debug the issue.

woodshop commented 2 years ago

Apologies for the slow reply!

(tensorflow-stable) asarroff@neu:/fsx/test/main$ dvc exp run -f -v
2022-01-24 16:52:02,910 DEBUG: Adding '/fsx/test/main/.dvc/config.local' to gitignore file.
2022-01-24 16:52:02,921 DEBUG: Adding '/fsx/test/main/.dvc/tmp' to gitignore file.
2022-01-24 16:52:02,922 DEBUG: Adding '/fsx/test/main/.dvc/cache' to gitignore file.
2022-01-24 16:52:03,523 DEBUG: Stashed experiment '3e6d5d6' with baseline '1fab656' for future execution.
2022-01-24 16:52:03,574 DEBUG: Reproducing experiment revs '3e6d5d6'
2022-01-24 16:52:03,711 DEBUG: Init workspace executor in '/fsx/test/main'
2022-01-24 16:52:03,841 DEBUG: Adding '/fsx/test/main/.dvc/config.local' to gitignore file.
2022-01-24 16:52:03,849 DEBUG: Adding '/fsx/test/main/.dvc/tmp' to gitignore file.
2022-01-24 16:52:03,849 DEBUG: Adding '/fsx/test/main/.dvc/cache' to gitignore file.
2022-01-24 16:52:03,853 DEBUG: Running repro in '/fsx/test/main'
2022-01-24 16:52:03,853 DEBUG: Removing '/fsx/test/main/.dvc/tmp/repro.dat'
2022-01-24 16:52:05,455 DEBUG: Removing output 'models/b.txt' of stage: 'cp'.
2022-01-24 16:52:05,455 DEBUG: Removing '/fsx/test/main/models/b.txt'
Running stage 'cp':
> bash run.sh
2022-01-24 16:52:05,551 DEBUG: staged tree 'object md5: 7836100ad7371e5f9125fbeb2b24a8e5.dir'
2022-01-24 16:52:05,552 DEBUG: state save (144115339624539791, 16d176444d6ed86e1a7e908b91b81625, 33) 7836100ad7371e5f9125fbeb2b24a8e5.dir
2022-01-24 16:52:05,557 DEBUG: Adding '/fsx/test/main/models/b.txt' to gitignore file.
2022-01-24 16:52:05,567 DEBUG: state save (144115339624540011, 1643061125000000000, 5) d8e8fca2dc0f896fd7cb4cb0031ba249
2022-01-24 16:52:05,583 DEBUG: state save (144115339624540011, 1643061125000000000, 5) d8e8fca2dc0f896fd7cb4cb0031ba249
2022-01-24 16:52:05,585 DEBUG: Computed stage: 'cp' md5: '44286a707e35ea3bf08062b5fe4b7152'
2022-01-24 16:52:05,595 DEBUG: staged tree 'object md5: 7836100ad7371e5f9125fbeb2b24a8e5.dir'
2022-01-24 16:52:05,596 DEBUG: state save (144115339624539791, 16d176444d6ed86e1a7e908b91b81625, 33) 7836100ad7371e5f9125fbeb2b24a8e5.dir
2022-01-24 16:52:05,607 DEBUG: staged tree 'object md5: 7836100ad7371e5f9125fbeb2b24a8e5.dir'
2022-01-24 16:52:05,607 DEBUG: state save (144115339624539791, 16d176444d6ed86e1a7e908b91b81625, 33) 7836100ad7371e5f9125fbeb2b24a8e5.dir
2022-01-24 16:52:05,622 DEBUG: Preparing to transfer data from '/fsx/test/main/.dvc/cache' to '/fsx/test/main/.dvc/cache'
2022-01-24 16:52:05,627 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 95] Operation not supported
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
    func(from_path, to_path)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/local.py", line 148, in reflink
    System.reflink(from_info, to_info)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/system.py", line 112, in reflink
    System._reflink_linux(source, link_name)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/system.py", line 96, in _reflink_linux
    fcntl.ioctl(d.fileno(), FICLONE, s.fileno())
OSError: [Errno 95] Operation not supported

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
    return _link(link, from_fs, from_path, to_fs, to_path)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
    raise OSError(
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
    _try_links([link], from_fs, from_file, to_fs, to_file)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
    raise OSError(
OSError: [Errno 95] no more link types left to try out
------------------------------------------------------------
2022-01-24 16:52:05,635 DEBUG: Removing '/fsx/test/main/models/.V7EcLywV6a78oWQAv9fYxa.tmp'
2022-01-24 16:52:05,636 DEBUG: Uploading '/fsx/test/main/.dvc/cache/.FeWKPhhRkQ8TQt3zZ9NyDG.tmp' to '/fsx/test/main/models/.V7EcLywV6a78oWQAv9fYxa.tmp'
2022-01-24 16:52:05,643 DEBUG: Removing '/fsx/test/main/models/.V7EcLywV6a78oWQAv9fYxa.tmp'
2022-01-24 16:52:05,644 DEBUG: Removing '/fsx/test/main/.dvc/cache/.FeWKPhhRkQ8TQt3zZ9NyDG.tmp'
2022-01-24 16:52:05,645 DEBUG: Removing '/fsx/test/main/models/b.txt'
2022-01-24 16:52:05,647 DEBUG: Uploading '/fsx/test/main/.dvc/cache/d8/e8fca2dc0f896fd7cb4cb0031ba249' to '/fsx/test/main/models/b.txt'
2022-01-24 16:52:05,654 DEBUG: state save (144115339624540020, 1643061125000000000, 5) d8e8fca2dc0f896fd7cb4cb0031ba249
2022-01-24 16:52:05,673 DEBUG: state save (144115339624540020, 1643061125000000000, 5) d8e8fca2dc0f896fd7cb4cb0031ba249
2022-01-24 16:52:05,679 DEBUG: stage: 'cp' was reproduced
2022-01-24 16:52:05,706 DEBUG: Staging files: {'dvc.yaml', 'src', 'data/a.txt'}

To track the changes with git, run:

    git add dvc.yaml src data/a.txt

To enable auto staging, run:

    dvc config core.autostage true
2022-01-24 16:52:06,189 DEBUG: Commit to new experiment branch 'refs/exps/1f/ab656a477d19c52d1d99ce1e151191afb74cd9/exp-1c9fb'
2022-01-24 16:52:06,528 DEBUG: Collected experiment '1e1c03b'.
2022-01-24 16:52:06,563 ERROR: unexpected error - invalid data in index - invalid entry
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/main.py", line 55, in main
    ret = cmd.do_run()
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/command/base.py", line 45, in do_run
    return self.run()
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/command/experiments/run.py", line 32, in run
    results = self.repo.experiments.run(
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 812, in run
    return run(self.repo, *args, **kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/run.py", line 32, in run
    return repo.experiments.reproduce_one(
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 433, in reproduce_one
    results = self._reproduce_revs(
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 51, in wrapper
    return f(exp, *args, **kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 636, in _reproduce_revs
    exec_results.update(self._executors_repro(manager, **kwargs))
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 62, in wrapper
    ret = f(exp, *args, **kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 667, in _executors_repro
    return manager.exec_queue(**kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager.py", line 350, in exec_queue
    self.cleanup_executor(exec_name, executor)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager.py", line 257, in cleanup_executor
    executor.cleanup()
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 165, in cleanup
    self.scm.set_ref(EXEC_APPLY, checkpoint)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/contextlib.py", line 525, in __exit__
    raise exc_details[1]
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/contextlib.py", line 510, in __exit__
    if cb(*exc_details):
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/scmrepo/git/__init__.py", line 380, in detach_head
    self.reset()
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/scmrepo/git/__init__.py", line 253, in _backend_func
    return func(*args, **kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/scmrepo/git/backend/pygit2.py", line 484, in reset
    self.repo.index.read(False)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/pygit2/repository.py", line 646, in index
    check_error(err, io=True)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/pygit2/errors.py", line 65, in check_error
    raise GitError(message)
_pygit2.GitError: invalid data in index - invalid entry
------------------------------------------------------------
2022-01-24 16:52:08,665 DEBUG: Adding '/fsx/test/main/.dvc/config.local' to gitignore file.
2022-01-24 16:52:08,673 DEBUG: Adding '/fsx/test/main/.dvc/tmp' to gitignore file.
2022-01-24 16:52:08,673 DEBUG: Adding '/fsx/test/main/.dvc/cache' to gitignore file.
2022-01-24 16:52:08,678 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 95] Operation not supported
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/main.py", line 55, in main
    ret = cmd.do_run()
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/command/base.py", line 45, in do_run
    return self.run()
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/command/experiments/run.py", line 32, in run
    results = self.repo.experiments.run(
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 812, in run
    return run(self.repo, *args, **kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/run.py", line 32, in run
    return repo.experiments.reproduce_one(
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 433, in reproduce_one
    results = self._reproduce_revs(
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 51, in wrapper
    return f(exp, *args, **kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 636, in _reproduce_revs
    exec_results.update(self._executors_repro(manager, **kwargs))
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 62, in wrapper
    ret = f(exp, *args, **kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 667, in _executors_repro
    return manager.exec_queue(**kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager.py", line 350, in exec_queue
    self.cleanup_executor(exec_name, executor)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager.py", line 257, in cleanup_executor
    executor.cleanup()
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 165, in cleanup
    self.scm.set_ref(EXEC_APPLY, checkpoint)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/contextlib.py", line 525, in __exit__
    raise exc_details[1]
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/contextlib.py", line 510, in __exit__
    if cb(*exc_details):
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/scmrepo/git/__init__.py", line 380, in detach_head
    self.reset()
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/scmrepo/git/__init__.py", line 253, in _backend_func
    return func(*args, **kwargs)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/scmrepo/git/backend/pygit2.py", line 484, in reset
    self.repo.index.read(False)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/pygit2/repository.py", line 646, in index
    check_error(err, io=True)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/pygit2/errors.py", line 65, in check_error
    raise GitError(message)
_pygit2.GitError: invalid data in index - invalid entry

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
    func(from_path, to_path)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/local.py", line 148, in reflink
    System.reflink(from_info, to_info)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/system.py", line 112, in reflink
    System._reflink_linux(source, link_name)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/system.py", line 96, in _reflink_linux
    fcntl.ioctl(d.fileno(), FICLONE, s.fileno())
OSError: [Errno 95] Operation not supported

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
    return _link(link, from_fs, from_path, to_fs, to_path)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
    raise OSError(
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
    _try_links([link], from_fs, from_file, to_fs, to_file)
  File "/home/asarroff/miniconda3/envs/tensorflow-stable/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
    raise OSError(
OSError: [Errno 95] no more link types left to try out
------------------------------------------------------------
2022-01-24 16:52:08,678 DEBUG: Removing '/fsx/test/.jw3rPE3S2u4AbGdwDSuHS7.tmp'
2022-01-24 16:52:08,680 DEBUG: Removing '/fsx/test/.jw3rPE3S2u4AbGdwDSuHS7.tmp'
2022-01-24 16:52:08,680 DEBUG: Removing '/fsx/test/.jw3rPE3S2u4AbGdwDSuHS7.tmp'
2022-01-24 16:52:08,681 DEBUG: Removing '/fsx/test/main/.dvc/cache/.4BUAToT43jGip9MS4pSMZ3.tmp'
2022-01-24 16:52:08,729 DEBUG: Version info for developers:
DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.8 on Linux-4.15.0-1065-aws-x86_64-with-glibc2.10
Supports:
    webhdfs (fsspec = 2021.10.1),
    http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
    https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
    s3 (s3fs = 2021.8.1, boto3 = 1.17.106)
Cache types: hardlink, symlink
Cache directory: lustre on 172.16.38.30@tcp:/skl3jbmv
Caches: local
Remotes: None
Workspace directory: lustre on 172.16.38.30@tcp:/skl3jbmv
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-01-24 16:52:08,733 DEBUG: Analytics is disabled.
woodshop commented 2 years ago

@pmrowla this issue is no longer "awaiting response" but I cannot change the label.

woodshop commented 2 years ago

Just wondering @daavoo if this issue is on any roadmap for resolution.

pmrowla commented 2 years ago

@woodshop unfortunately we have not been able to get to this issue yet, and it's not currently planned. There's a few submodule related dvc exp issues that are open, but given that it's not a very common setup (at least based on user reports we've received so far) we have had to prioritize other work over addressing the submodule problems.

rick-van-veen commented 2 years ago

I would actually like to see this working as well for the --temp, --queue case.

jnareb commented 1 year ago

I have had the same issue, but it happens even if dvc exp run is run from top directory of the project, and when submodule changes are committed (git status returns all clear).

What is important is that dvc exp run not only fails with cryptic error message, but it makes a mess out of repository:

$ git status
On branch jn/dvc.

nothing to commit, working tree clean
$ dvc exp run
[...]
ERROR: unexpected error - invalid data in index - invalid entry

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
$ git status
On branch jn/dvc

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   .git/COMMIT_EDITMSG
        new file:   .git/FETCH_HEAD
        new file:   .git/HEAD
[... lots and lots of files, including all submodule files ...]

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   .git/HEAD
        modified:   .git/index
        modified:   .git/logs/HEAD
        deleted:    .git/refs/exps/exec/EXEC_BASELINE
        deleted:    .git/refs/exps/exec/EXEC_MERGE

To get rid of this I need to do git reset --hard HEAD in repository (where I get lots of "error: invalid path" warnings) and in submodule.

DVC should at least detect that it cannot run dvc exp run, instead of messing up the state of Git repository.

Output of dvc doctor:

$ dvc doctor
DVC version: 3.25.0 (pip)
-------------------------
Platform: Python 3.11.4 on Linux-6.4.0-2-amd64-x86_64-with-glibc2.37
Subprojects:
        dvc_data = 2.18.1
        dvc_objects = 1.0.1
        dvc_render = 0.6.0
        dvc_task = 0.3.0
        scmrepo = 1.3.1
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.9.2, boto3 = 1.28.17)
dflatow commented 9 months ago

Hey team -- I'm running into this same error message:

ERROR: unexpected error - invalid data in index - invalid entry

Also similarly, dvc repro -R seems to work but dvc exp run -R src/pipelines/ does not.

$ dvc doctor
DVC version: 3.33.4 (pip)
-------------------------
Platform: Python 3.10.13 on macOS-12.6-arm64-arm-64bit
Subprojects:
        dvc_data = 2.24.0
        dvc_objects = 2.0.1
        dvc_render = 1.0.0
        dvc_task = 0.3.0
        scmrepo = 1.6.0
Supports:
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.12.2, boto3 = 1.33.13)
Config:
        Global: /Users/dflatow/Library/Application Support/dvc
        System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/ceb4be7a51a3752fe6157394a646b490
dflatow commented 9 months ago

Not sure if different or related issue.

dflatow commented 9 months ago

Also unsure if this is related but some strange git things seem to be going on - DVC has added a few files (metric files) outside of my repo:

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   ../../../../var/folders/2h/2_3k3c317ts84_86kzzcpqjh0000gn/T/tmprlj4wq1v/noise/evaluate/metrics/metrics.json
        new file:   ../../../../var/folders/2h/2_3k3c317ts84_86kzzcpqjh0000gn/T/tmptg4lvlwb/noise/evaluate/metrics/metrics.json
        new file:   ../../../../var/folders/2h/2_3k3c317ts84_86kzzcpqjh0000gn/T/tmpvjg0amiu/noise/evaluate/metrics/metrics.json
        new file:   ../../../../var/folders/2h/2_3k3c317ts84_86kzzcpqjh0000gn/T/tmpvjyepa6l/noise/evaluate/metrics/metrics.json

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        deleted:    ../../../../var/folders/2h/2_3k3c317ts84_86kzzcpqjh0000gn/T/tmprlj4wq1v/noise/evaluate/metrics/metrics.json
        deleted:    ../../../../var/folders/2h/2_3k3c317ts84_86kzzcpqjh0000gn/T/tmptg4lvlwb/noise/evaluate/metrics/metrics.json
        deleted:    ../../../../var/folders/2h/2_3k3c317ts84_86kzzcpqjh0000gn/T/tmpvjg0amiu/noise/evaluate/metrics/metrics.json
        deleted:    ../../../../var/folders/2h/2_3k3c317ts84_86kzzcpqjh0000gn/T/tmpvjyepa6l/noise/evaluate/metrics/metrics.json
dflatow commented 9 months ago

hmmm seems the errant metrics files were created but not properly cleaned up in a tmp directory I was using for some testing.

dflatow commented 9 months ago

Ok so I think the issue here was due to some sort of git issue caused by running DVC Live in a temporary folder via python library with save_exp=True.