Open vitalwarley opened 1 year ago
The error persists after updating DVC.
DVC version: 2.55.0 (pip)
-------------------------
Platform: Python 3.10.10 on Linux-6.2.9-arch1-1-x86_64-with-glibc2.37
Subprojects:
dvc_data = 0.47.1
dvc_objects = 0.21.1
dvc_render = 0.3.1
dvc_task = 0.2.0
scmrepo = 1.0.2
Supports:
azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
gdrive (pydrive2 = 1.15.3),
gs (gcsfs = 2023.3.0),
hdfs (fsspec = 2023.3.0, pyarrow = 11.0.0),
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
oss (ossfs = 2023.3.0),
s3 (s3fs = 2023.3.0, boto3 = 1.24.59),
ssh (sshfs = 2023.4.1),
webdav (webdav4 = 0.9.8),
webdavs (webdav4 = 0.9.8),
webhdfs (fsspec = 2023.3.0)
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/nvme0n1p3
Caches: local
Remotes: gdrive, s3, ssh, local
Workspace directory: btrfs on /dev/nvme0n1p3
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/de05466397256ce7a1821f5910692a5e
It also shows with dvc queue status
:
╚═╡(toledo) [12:20] λ dvc queue status
Task Name Created Status
47439bb ratty-sing 11:54 AM Queued
0e73ad7 minim-wire 11:55 AM Queued
d8bece7 heady-half 11:56 AM Queued
96b1b69 mired-cree 11:58 AM Queued
b98d730 awned-mean 11:59 AM Queued
f998fd9 scald-deys 12:00 PM Queued
f8c3a3c azoic-amie 12:01 PM Queued
e8beb49 brood-duff 12:03 PM Queued
dccc2d1 bally-jigs 12:04 PM Queued
d4df116 quasi-rift 12:05 PM Queued
55f3e44 azure-bump 12:07 PM Queued
8e3c5c8 gummy-puku 12:08 PM Queued
6ca9c73 hired-bite 12:09 PM Queued
e36b2ed genal-flex 12:11 PM Queued
1e8a5a7 beamy-ghat 12:12 PM Queued
8f7fa7f sable-line 12:13 PM Queued
aa9e86d alive-ions 12:14 PM Queued
5f5a44f unbid-smut 12:16 PM Queued
6270f03 silty-walk 12:17 PM Queued
82cbedf gammy-auks 12:18 PM Queued
b589436 grade-cree 12:20 PM Queued
ERROR: unexpected error - Extra data: line 1 column 16750 (char 16749)
Have you tried dvc exp clean
?
@dberenbaum, I didn't, but after trying it the error continues
╚═╡(toledo) [9:30] λ dvc exp clean
Cleaning up dvc-task messages...
Done!
╔╡[warley]:[vital-strix]➾[~/dev/toledo] | [on branch container/experiments]
╚═╡(toledo) [9:30] λ dvc queue status
No experiment tasks in the queue.
ERROR: unexpected error - Extra data: line 1 column 16750 (char 16749)
Is there any other info you need to help us better diagnose the problem?
Is it important to preserve what's in the queue, or would you be okay to regenerate the queue if you can get back to a working state? You could try to remove everything in .dvc/tmp/exps
if you don't mind losing the queue.
I regenerated the queue after removing .dvc/tmp/exps
. The problem disappeared, but
╚═╡(toledo) [17:13] λ dvc queue status -v
2023-04-24 17:16:58,937 DEBUG: v2.55.0 (pip), CPython 3.10.10 on Linux-6.2.9-arch1-1-x86_64-with-glibc2.37
2023-04-24 17:16:58,937 DEBUG: command: /home/warley/.virtualenvs/toledo/bin/dvc queue status --verbose
Task Name Created Status
3550d20 massy-fuss 05:10 PM Running
14af6f7 dated-heck 05:11 PM Running
bf39714 osmic-sash 05:12 PM Running
652f215 hexed-doek 05:12 PM Running
1914655 cheek-oast 05:13 PM Queued
0432da2 roman-mete 05:14 PM Queued
4785ea9 naive-pecs 05:14 PM Queued
d680c35 felon-kaka 05:15 PM Queued
be8e557 heigh-kite 05:16 PM Queued
b62df65 sural-line 04:46 PM Failed
1e7798f gouty-loss 04:47 PM Failed
574bc49 filar-vara 04:48 PM Failed
ce7fa81 weird-food 04:48 PM Failed
0fc7236 funny-keys 04:49 PM Failed
c15a77b naive-feel 04:50 PM Failed
cb8e869 dingy-vela 04:50 PM Failed
d56400d toric-stir 04:51 PM Failed
c518839 store-gray 04:52 PM Failed
9534a92 tidal-cart 04:52 PM Failed
95013ee hunky-kine 04:53 PM Failed
c86a049 rival-tyke 04:54 PM Failed
76afb28 drear-skin 04:54 PM Failed
7e2d07f busty-ruin 04:55 PM Failed
b6b8151 boned-fore 04:56 PM Failed
ccee8d0 prime-ankh 04:56 PM Failed
50c17bc owing-rial 04:57 PM Failed
d8b0ec9 volar-mome 04:57 PM Failed
89d40c4 ethic-wads 04:58 PM Failed
5202a01 blown-wart 04:59 PM Failed
fbacc31 quack-zone 04:59 PM Failed
95cdd1e union-pant 05:00 PM Failed
0192f49 wonky-pans 05:01 PM Failed
74dbf02 major-crag 05:01 PM Failed
aa6e440 tippy-coze 05:02 PM Failed
028c2ad riven-razz 05:03 PM Failed
b2f6eb8 mesne-nibs 05:03 PM Failed
2418702 gusty-kibe 05:04 PM Failed
3ae3a99 weepy-loss 05:05 PM Failed
d8c8901 heapy-cyma 05:05 PM Failed
b685c3c shyer-coze 05:06 PM Failed
29ac81e epoxy-dirk 05:07 PM Failed
a57b02f stoic-zest 05:08 PM Failed
f34fca7 runty-poss 05:08 PM Failed
272f182 scrap-amah 05:09 PM Failed
d8cf3a3 misty-airt 05:10 PM Failed
2023-04-24 17:17:27,196 DEBUG: Worker status: {'dvc-exp-fb73ab-7@localhost': [{'id': '5f790ec3-5f6c-40de-94fe-2f82610a2c4c', 'name': 'dvc.repo.experiments.queue.tasks.run_exp', 'args': [{'dvc_root': '/home/warley/dev/toledo', 'scm_root': '/home/warley/dev/toledo', 'stash_ref': 'refs/exps/celery/stash', 'stash_rev': 'bf39714922b45160df7c8effaedf748252c96e4e', 'baseline_rev': 'df9089d56c6695a44a846da80aa188c115eec047', 'branch': None, 'name': 'osmic-sash', 'head_rev': 'df9089d56c6695a44a846da80aa188c115eec047'}], 'kwargs': {'copy_paths': []}, 'type': 'dvc.repo.experiments.queue.tasks.run_exp', 'hostname': 'dvc-exp-fb73ab-7@localhost', 'time_start': 1682367330.5642037, 'acknowledged': False, 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': None}, 'worker_pid': 1170030}], 'dvc-exp-fb73ab-3@localhost': [{'id': 'abfd86d9-65ea-48cc-bf1a-37a6911cd9b3', 'name': 'dvc.repo.experiments.queue.tasks.run_exp', 'args': [{'dvc_root': '/home/warley/dev/toledo', 'scm_root': '/home/warley/dev/toledo', 'stash_ref': 'refs/exps/celery/stash', 'stash_rev': '14af6f7cdcfdcdfaf75f67f73184fc0e0c3719ad', 'baseline_rev': 'df9089d56c6695a44a846da80aa188c115eec047', 'branch': None, 'name': 'dated-heck', 'head_rev': 'df9089d56c6695a44a846da80aa188c115eec047'}], 'kwargs': {'copy_paths': []}, 'type': 'dvc.repo.experiments.queue.tasks.run_exp', 'hostname': 'dvc-exp-fb73ab-3@localhost', 'time_start': 1682367330.5171926, 'acknowledged': False, 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': None}, 'worker_pid': 1170022}], 'dvc-exp-fb73ab-1@localhost': [{'id': '9891f441-1198-41f0-accc-56e588003ecd', 'name': 'dvc.repo.experiments.queue.tasks.run_exp', 'args': [{'dvc_root': '/home/warley/dev/toledo', 'scm_root': '/home/warley/dev/toledo', 'stash_ref': 'refs/exps/celery/stash', 'stash_rev': '3550d206c8c3947fe7d6d362e57aa683af592b9c', 'baseline_rev': 'df9089d56c6695a44a846da80aa188c115eec047', 'branch': None, 'name': 'massy-fuss', 'head_rev': 'df9089d56c6695a44a846da80aa188c115eec047'}], 'kwargs': {'copy_paths': []}, 'type': 'dvc.repo.experiments.queue.tasks.run_exp', 'hostname': 'dvc-exp-fb73ab-1@localhost', 'time_start': 1682367308.2422833, 'acknowledged': False, 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': None}, 'worker_pid': 1169795}], 'dvc-exp-fb73ab-5@localhost': [{'id': 'e9bed033-87ec-408d-a4b3-018baa792729', 'name': 'dvc.repo.experiments.queue.tasks.run_exp', 'args': [{'dvc_root': '/home/warley/dev/toledo', 'scm_root': '/home/warley/dev/toledo', 'stash_ref': 'refs/exps/celery/stash', 'stash_rev': '652f2157afa0b322bb507e6851b46562b4986ae6', 'baseline_rev': 'df9089d56c6695a44a846da80aa188c115eec047', 'branch': None, 'name': 'hexed-doek', 'head_rev': 'df9089d56c6695a44a846da80aa188c115eec047'}], 'kwargs': {'copy_paths': []}, 'type': 'dvc.repo.experiments.queue.tasks.run_exp', 'hostname': 'dvc-exp-fb73ab-5@localhost', 'time_start': 1682367336.297317, 'acknowledged': False, 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': None}, 'worker_pid': 1170026}]}
Worker status: 4 active, 0 idle
I only get failed experiments after some time. I also can't get the logs to inspect what happened
╔╡[warley]:[vital-strix]➾[~/dev/toledo] | [on branch exps/exec/EXEC_HEAD]
╚═╡(toledo) [17:22] λ dvc queue logs gusty-kibe -v
2023-04-24 17:22:52,036 DEBUG: v2.55.0 (pip), CPython 3.10.10 on Linux-6.2.9-arch1-1-x86_64-with-glibc2.37
2023-04-24 17:22:52,037 DEBUG: command: /home/warley/.virtualenvs/toledo/bin/dvc queue logs gusty-kibe -v
2023-04-24 17:23:11,829 ERROR: No output logs found for experiment 'gusty-kibe'
Traceback (most recent call last):
File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/funcy/flow.py", line 84, in reraise
yield
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc_task/proc/manager.py", line 50, in __getitem__
return ProcessInfo.load(info_path)
File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc_task/proc/process.py", line 40, in load
with open(filename, encoding="utf-8") as fobj:
FileNotFoundError: [Errno 2] No such file or directory: '/home/warley/dev/toledo/.dvc/tmp/exps/run/24187020c9086b2e8932a6f56f58296771d688c7/24187020c9086b2e8932a6f56f58296771d688c7.json'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/repo/experiments/queue/celery.py", line 456, in logs
proc_info = self.proc[queue_entry.stash_rev]
File "/usr/lib/python3.10/contextlib.py", line 78, in inner
with self._recreate_cm():
File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/funcy/flow.py", line 88, in reraise
raise into from e
KeyError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/cli/__init__.py", line 210, in main
ret = cmd.do_run()
File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/cli/command.py", line 26, in do_run
return self.run()
File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/commands/queue/logs.py", line 14, in run
self.repo.experiments.celery_queue.logs(
File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/repo/experiments/queue/celery.py", line 458, in logs
raise DvcException( # noqa: B904
dvc.exceptions.DvcException: No output logs found for experiment 'gusty-kibe'
I managed to get the logs from one experiment
╚═╡(toledo) [17:34] λ dvc queue logs 00c825
ERROR: failed to reproduce 'benchmark-ocr@validation': [Errno 2] No such file or directory: '/home/warley/dev/toledo/.dvc/tmp/exps/tmpgax00k6m/src/python/cvt/tools'
ERROR: failed to reproduce 'benchmark-ocr@validation': [Errno 2] No such file or directory: '/home/warley/dev/toledo/.dvc/tmp/exps/tmpgax00k6m/src/python/cvt/tools'
src/python/cvt/tools
is a submodule inside another submodule; that is, src/python
is one, while cvt
is another. Could this be the problem?
In the dvc.yaml
this path is specified as a dep
benchmark-ocr:
foreach:
- test
- validation
do:
cmd: >-
python src/python/cvt/tools/main.py benchmark
...
deps:
- src/python/cvt/tools
I removed the dep and tried again
╚═╡(toledo) [18:14] λ dvc queue logs -f sixty-cyst
Following logs for experiment 'sixty-cyst'. Use Ctrl+C to stop following logs (experiment execution will continue).
ERROR: failed to reproduce 'benchmark-ocr@validation': [Errno 2] No such file or directory: '/home/warley/dev/toledo/.dvc/tmp/exps/tmp1mwvc_7b/datasets/container/cd/raw/validation/images'
ERROR: failed to reproduce 'benchmark-ocr@validation': [Errno 2] No such file or directory: '/home/warley/dev/toledo/.dvc/tmp/exps/tmp1mwvc_7b/datasets/container/cd/raw/validation/images'
This path is also listed as a dependency
deps:
- ${base.dataset-dir}/${base.scope}/${benchmark.cd-dir}/${item}/images
- ${base.dataset-dir}/${base.scope}/${benchmark.cr-dir}/${item}/labels
- ${base.models-dir}/${base.scope}/${benchmark.weights-cd}
- ${base.models-dir}/${base.scope}/${benchmark.weights-cr}
@vitalwarley submodule dependencies are not supported in queue runs, see https://github.com/iterative/dvc/issues/7186
@vitalwarley submodule dependencies are not supported in queue runs, see #7186
@pmrowla Would there be any unexpected side effect if --copy-paths
is used as a workaround for submodules?
@daavoo I think that should be OK as a workaround for now.
Ideally we would handle submodules correctly when we generate the temp git workspace, but #7186 hasn't been addressed yet because it needs some more investigation/research. Basically I'm not currently sure what the correct way to handle unstaged or uncommitted changes in the submodule is, but technically I think we should be trying to include them in the experiment (and not just doing git submodule pull
in the temp git workspace).
But copying the entire submodule with --copy-paths
should at least give you the expected result in typical cases, with the caveat that we are only copying the state of the submodule at the time the experiment is actually run (which is not necessarily the same as the state when the user did exp run --queue
)
@vitalwarley submodule dependencies are not supported in queue runs, see #7186
Ah, I see. Thanks for the info.
@pmrowla Would there be any unexpected side effect if
--copy-paths
is used as a workaround for submodules?
I reverted the submodule dependency removal and tried this parameter. It seems to solve the first issue (submodule listed as dep there), but the other (submodule removed as dep there) one happened again.
Bug Report
Description
I set up multiple experiments with Hydra range sweep, but I can't start them.
Reproduce
dvc exp run -s benchmark-ocr@validation -S 'benchmark.conf-thresh-cd=range(0.1, 0.9, 0.1)' -S 'benchmark.conf-thresh-cr=range(0.1, 0.9, 0.1)' --queue
dvc queue start
Expected
Running experiments.
Environment information
Output of
dvc doctor
:Additional Information (if any):
exp run output
queue start verbose output