iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.91k stars 1.19k forks source link

`dvc queue start`: doesn't start exp, but raises JSONDecodeError #9358

Open vitalwarley opened 1 year ago

vitalwarley commented 1 year ago

Bug Report

Description

I set up multiple experiments with Hydra range sweep, but I can't start them.

Reproduce

  1. Run a specific stage: dvc exp run -s benchmark-ocr@validation -S 'benchmark.conf-thresh-cd=range(0.1, 0.9, 0.1)' -S 'benchmark.conf-thresh-cr=range(0.1, 0.9, 0.1)' --queue
  2. Start queued experiments: dvc queue start

Expected

Running experiments.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.53.0 (pip)
-------------------------
Platform: Python 3.10.10 on Linux-6.2.9-arch1-1-x86_64-with-glibc2.37
Subprojects:
    dvc_data = 0.47.1
    dvc_objects = 0.21.1
    dvc_render = 0.3.1
    dvc_task = 0.2.0
    scmrepo = 0.2.1
Supports:
    azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
    gdrive (pydrive2 = 1.15.3),
    gs (gcsfs = 2023.3.0),
    hdfs (fsspec = 2023.3.0, pyarrow = 11.0.0),
    http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
    oss (ossfs = 2023.3.0),
    s3 (s3fs = 2023.3.0, boto3 = 1.24.59),
    ssh (sshfs = 2023.4.1),
    webdav (webdav4 = 0.9.8),
    webdavs (webdav4 = 0.9.8),
    webhdfs (fsspec = 2023.3.0)
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/nvme0n1p3
Caches: local
Remotes: gdrive, s3, ssh, local
Workspace directory: btrfs on /dev/nvme0n1p3
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/de05466397256ce7a1821f5910692a5e

Additional Information (if any):

exp run output

/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'config': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
Queueing with overrides '{'params.yaml': ['benchmark.conf-thresh-cd=0.1', 'benchmark.conf-thresh-cr=0.1']}'.                                                                                                                                                               
Queued experiment 'ratty-sing' for future execution.
...

queue start verbose output

2023-04-22 12:17:39,706 DEBUG: v2.53.0 (pip), CPython 3.10.10 on Linux-6.2.9-arch1-1-x86_64-with-glibc2.37
2023-04-22 12:17:39,707 DEBUG: command: /home/warley/.virtualenvs/toledo/bin/dvc queue start -vvv
2023-04-22 12:17:39,707 TRACE: Namespace(cprofile=False, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, show_stack=False, quiet=0, verbose=3, cd='.', cmd='start', jobs=1, func=<class 'dvc.commands.queue.start.CmdQueueStart'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-04-22 12:17:39,953 DEBUG: Spawning 1 exp queue workers
2023-04-22 12:17:39,966 ERROR: unexpected error - Extra data: line 1 column 16750 (char 16749)
Traceback (most recent call last):
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/commands/queue/start.py", line 15, in run
    started = self.repo.experiments.celery_queue.start_workers(self.args.jobs)
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/repo/experiments/queue/celery.py", line 163, in start_workers
    active_worker: Dict = self.worker_status()
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/repo/experiments/queue/celery.py", line 461, in worker_status
    status = self.celery.control.inspect().active() or {}
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/celery/app/control.py", line 149, in active
    return self._request('active', safe=safe)
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/celery/app/control.py", line 106, in _request
    return self._prepare(self.app.control.broadcast(
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/celery/app/control.py", line 741, in broadcast
    return self.mailbox(conn)._broadcast(
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/kombu/pidbox.py", line 335, in _broadcast
    self._publish(command, arguments, destination=destination,
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/kombu/pidbox.py", line 297, in _publish
    maybe_declare(self.reply_queue(chan))
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/kombu/common.py", line 110, in maybe_declare
    return _maybe_declare(entity, channel)
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/kombu/common.py", line 150, in _maybe_declare
    entity.declare(channel=channel)
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/kombu/entity.py", line 606, in declare
    self._create_queue(nowait=nowait, channel=channel)
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/kombu/entity.py", line 617, in _create_queue
    self.queue_bind(nowait=nowait, channel=channel)
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/kombu/entity.py", line 660, in queue_bind
    return self.bind_to(self.exchange, self.routing_key,
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/kombu/entity.py", line 669, in bind_to
    return (channel or self.channel).queue_bind(
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/kombu/transport/virtual/base.py", line 562, in queue_bind
    self._queue_bind(exchange, *meta)
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc_task/contrib/kombu_filesystem.py", line 93, in _queue_bind
    exchange_table = loads(bytes_to_str(f_obj.read()))
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/kombu/utils/json.py", line 88, in loads
    return _loads(s)
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 16750 (char 16749)

2023-04-22 12:17:39,993 DEBUG: Removing '/home/warley/dev/.2DX8VBApUJ35J7fkWcpLQy.tmp'
2023-04-22 12:17:39,993 DEBUG: Removing '/home/warley/dev/.2DX8VBApUJ35J7fkWcpLQy.tmp'
2023-04-22 12:17:39,993 DEBUG: Removing '/home/warley/dev/.2DX8VBApUJ35J7fkWcpLQy.tmp'
2023-04-22 12:17:39,993 DEBUG: Removing '/home/warley/dev/toledo/.dvc/cache/.jVfzTnMXgLVZMybVUxWeiS.tmp'
vitalwarley commented 1 year ago

The error persists after updating DVC.

DVC version: 2.55.0 (pip)
-------------------------
Platform: Python 3.10.10 on Linux-6.2.9-arch1-1-x86_64-with-glibc2.37
Subprojects:
    dvc_data = 0.47.1
    dvc_objects = 0.21.1
    dvc_render = 0.3.1
    dvc_task = 0.2.0
    scmrepo = 1.0.2
Supports:
    azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
    gdrive (pydrive2 = 1.15.3),
    gs (gcsfs = 2023.3.0),
    hdfs (fsspec = 2023.3.0, pyarrow = 11.0.0),
    http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
    oss (ossfs = 2023.3.0),
    s3 (s3fs = 2023.3.0, boto3 = 1.24.59),
    ssh (sshfs = 2023.4.1),
    webdav (webdav4 = 0.9.8),
    webdavs (webdav4 = 0.9.8),
    webhdfs (fsspec = 2023.3.0)
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/nvme0n1p3
Caches: local
Remotes: gdrive, s3, ssh, local
Workspace directory: btrfs on /dev/nvme0n1p3
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/de05466397256ce7a1821f5910692a5e

It also shows with dvc queue status:

╚═╡(toledo) [12:20] λ dvc queue status
Task     Name        Created    Status
47439bb  ratty-sing  11:54 AM   Queued
0e73ad7  minim-wire  11:55 AM   Queued
d8bece7  heady-half  11:56 AM   Queued
96b1b69  mired-cree  11:58 AM   Queued
b98d730  awned-mean  11:59 AM   Queued
f998fd9  scald-deys  12:00 PM   Queued
f8c3a3c  azoic-amie  12:01 PM   Queued
e8beb49  brood-duff  12:03 PM   Queued
dccc2d1  bally-jigs  12:04 PM   Queued
d4df116  quasi-rift  12:05 PM   Queued
55f3e44  azure-bump  12:07 PM   Queued
8e3c5c8  gummy-puku  12:08 PM   Queued
6ca9c73  hired-bite  12:09 PM   Queued
e36b2ed  genal-flex  12:11 PM   Queued
1e8a5a7  beamy-ghat  12:12 PM   Queued
8f7fa7f  sable-line  12:13 PM   Queued
aa9e86d  alive-ions  12:14 PM   Queued
5f5a44f  unbid-smut  12:16 PM   Queued
6270f03  silty-walk  12:17 PM   Queued
82cbedf  gammy-auks  12:18 PM   Queued
b589436  grade-cree  12:20 PM   Queued

ERROR: unexpected error - Extra data: line 1 column 16750 (char 16749)
dberenbaum commented 1 year ago

Have you tried dvc exp clean?

vitalwarley commented 1 year ago

@dberenbaum, I didn't, but after trying it the error continues

╚═╡(toledo) [9:30] λ dvc exp clean
Cleaning up dvc-task messages...
Done!
╔╡[warley]:[vital-strix]➾[~/dev/toledo] | [on branch container/experiments] 
╚═╡(toledo) [9:30] λ dvc queue status
No experiment tasks in the queue.

ERROR: unexpected error - Extra data: line 1 column 16750 (char 16749)

Is there any other info you need to help us better diagnose the problem?

dberenbaum commented 1 year ago

Is it important to preserve what's in the queue, or would you be okay to regenerate the queue if you can get back to a working state? You could try to remove everything in .dvc/tmp/exps if you don't mind losing the queue.

vitalwarley commented 1 year ago

I regenerated the queue after removing .dvc/tmp/exps. The problem disappeared, but

╚═╡(toledo) [17:13] λ dvc queue status -v
2023-04-24 17:16:58,937 DEBUG: v2.55.0 (pip), CPython 3.10.10 on Linux-6.2.9-arch1-1-x86_64-with-glibc2.37
2023-04-24 17:16:58,937 DEBUG: command: /home/warley/.virtualenvs/toledo/bin/dvc queue status --verbose
Task     Name        Created    Status
3550d20  massy-fuss  05:10 PM   Running
14af6f7  dated-heck  05:11 PM   Running
bf39714  osmic-sash  05:12 PM   Running
652f215  hexed-doek  05:12 PM   Running
1914655  cheek-oast  05:13 PM   Queued
0432da2  roman-mete  05:14 PM   Queued
4785ea9  naive-pecs  05:14 PM   Queued
d680c35  felon-kaka  05:15 PM   Queued
be8e557  heigh-kite  05:16 PM   Queued
b62df65  sural-line  04:46 PM   Failed
1e7798f  gouty-loss  04:47 PM   Failed
574bc49  filar-vara  04:48 PM   Failed
ce7fa81  weird-food  04:48 PM   Failed
0fc7236  funny-keys  04:49 PM   Failed
c15a77b  naive-feel  04:50 PM   Failed
cb8e869  dingy-vela  04:50 PM   Failed
d56400d  toric-stir  04:51 PM   Failed
c518839  store-gray  04:52 PM   Failed
9534a92  tidal-cart  04:52 PM   Failed
95013ee  hunky-kine  04:53 PM   Failed
c86a049  rival-tyke  04:54 PM   Failed
76afb28  drear-skin  04:54 PM   Failed
7e2d07f  busty-ruin  04:55 PM   Failed
b6b8151  boned-fore  04:56 PM   Failed
ccee8d0  prime-ankh  04:56 PM   Failed
50c17bc  owing-rial  04:57 PM   Failed
d8b0ec9  volar-mome  04:57 PM   Failed
89d40c4  ethic-wads  04:58 PM   Failed
5202a01  blown-wart  04:59 PM   Failed
fbacc31  quack-zone  04:59 PM   Failed
95cdd1e  union-pant  05:00 PM   Failed
0192f49  wonky-pans  05:01 PM   Failed
74dbf02  major-crag  05:01 PM   Failed
aa6e440  tippy-coze  05:02 PM   Failed
028c2ad  riven-razz  05:03 PM   Failed
b2f6eb8  mesne-nibs  05:03 PM   Failed
2418702  gusty-kibe  05:04 PM   Failed
3ae3a99  weepy-loss  05:05 PM   Failed
d8c8901  heapy-cyma  05:05 PM   Failed
b685c3c  shyer-coze  05:06 PM   Failed
29ac81e  epoxy-dirk  05:07 PM   Failed
a57b02f  stoic-zest  05:08 PM   Failed
f34fca7  runty-poss  05:08 PM   Failed
272f182  scrap-amah  05:09 PM   Failed
d8cf3a3  misty-airt  05:10 PM   Failed

2023-04-24 17:17:27,196 DEBUG: Worker status: {'dvc-exp-fb73ab-7@localhost': [{'id': '5f790ec3-5f6c-40de-94fe-2f82610a2c4c', 'name': 'dvc.repo.experiments.queue.tasks.run_exp', 'args': [{'dvc_root': '/home/warley/dev/toledo', 'scm_root': '/home/warley/dev/toledo', 'stash_ref': 'refs/exps/celery/stash', 'stash_rev': 'bf39714922b45160df7c8effaedf748252c96e4e', 'baseline_rev': 'df9089d56c6695a44a846da80aa188c115eec047', 'branch': None, 'name': 'osmic-sash', 'head_rev': 'df9089d56c6695a44a846da80aa188c115eec047'}], 'kwargs': {'copy_paths': []}, 'type': 'dvc.repo.experiments.queue.tasks.run_exp', 'hostname': 'dvc-exp-fb73ab-7@localhost', 'time_start': 1682367330.5642037, 'acknowledged': False, 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': None}, 'worker_pid': 1170030}], 'dvc-exp-fb73ab-3@localhost': [{'id': 'abfd86d9-65ea-48cc-bf1a-37a6911cd9b3', 'name': 'dvc.repo.experiments.queue.tasks.run_exp', 'args': [{'dvc_root': '/home/warley/dev/toledo', 'scm_root': '/home/warley/dev/toledo', 'stash_ref': 'refs/exps/celery/stash', 'stash_rev': '14af6f7cdcfdcdfaf75f67f73184fc0e0c3719ad', 'baseline_rev': 'df9089d56c6695a44a846da80aa188c115eec047', 'branch': None, 'name': 'dated-heck', 'head_rev': 'df9089d56c6695a44a846da80aa188c115eec047'}], 'kwargs': {'copy_paths': []}, 'type': 'dvc.repo.experiments.queue.tasks.run_exp', 'hostname': 'dvc-exp-fb73ab-3@localhost', 'time_start': 1682367330.5171926, 'acknowledged': False, 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': None}, 'worker_pid': 1170022}], 'dvc-exp-fb73ab-1@localhost': [{'id': '9891f441-1198-41f0-accc-56e588003ecd', 'name': 'dvc.repo.experiments.queue.tasks.run_exp', 'args': [{'dvc_root': '/home/warley/dev/toledo', 'scm_root': '/home/warley/dev/toledo', 'stash_ref': 'refs/exps/celery/stash', 'stash_rev': '3550d206c8c3947fe7d6d362e57aa683af592b9c', 'baseline_rev': 'df9089d56c6695a44a846da80aa188c115eec047', 'branch': None, 'name': 'massy-fuss', 'head_rev': 'df9089d56c6695a44a846da80aa188c115eec047'}], 'kwargs': {'copy_paths': []}, 'type': 'dvc.repo.experiments.queue.tasks.run_exp', 'hostname': 'dvc-exp-fb73ab-1@localhost', 'time_start': 1682367308.2422833, 'acknowledged': False, 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': None}, 'worker_pid': 1169795}], 'dvc-exp-fb73ab-5@localhost': [{'id': 'e9bed033-87ec-408d-a4b3-018baa792729', 'name': 'dvc.repo.experiments.queue.tasks.run_exp', 'args': [{'dvc_root': '/home/warley/dev/toledo', 'scm_root': '/home/warley/dev/toledo', 'stash_ref': 'refs/exps/celery/stash', 'stash_rev': '652f2157afa0b322bb507e6851b46562b4986ae6', 'baseline_rev': 'df9089d56c6695a44a846da80aa188c115eec047', 'branch': None, 'name': 'hexed-doek', 'head_rev': 'df9089d56c6695a44a846da80aa188c115eec047'}], 'kwargs': {'copy_paths': []}, 'type': 'dvc.repo.experiments.queue.tasks.run_exp', 'hostname': 'dvc-exp-fb73ab-5@localhost', 'time_start': 1682367336.297317, 'acknowledged': False, 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': None}, 'worker_pid': 1170026}]}
Worker status: 4 active, 0 idle

I only get failed experiments after some time. I also can't get the logs to inspect what happened

╔╡[warley]:[vital-strix]➾[~/dev/toledo] | [on branch exps/exec/EXEC_HEAD] 
╚═╡(toledo) [17:22] λ dvc queue logs gusty-kibe -v 
2023-04-24 17:22:52,036 DEBUG: v2.55.0 (pip), CPython 3.10.10 on Linux-6.2.9-arch1-1-x86_64-with-glibc2.37
2023-04-24 17:22:52,037 DEBUG: command: /home/warley/.virtualenvs/toledo/bin/dvc queue logs gusty-kibe -v
2023-04-24 17:23:11,829 ERROR: No output logs found for experiment 'gusty-kibe'
Traceback (most recent call last):
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/funcy/flow.py", line 84, in reraise
    yield
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc_task/proc/manager.py", line 50, in __getitem__
    return ProcessInfo.load(info_path)
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc_task/proc/process.py", line 40, in load
    with open(filename, encoding="utf-8") as fobj:
FileNotFoundError: [Errno 2] No such file or directory: '/home/warley/dev/toledo/.dvc/tmp/exps/run/24187020c9086b2e8932a6f56f58296771d688c7/24187020c9086b2e8932a6f56f58296771d688c7.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/repo/experiments/queue/celery.py", line 456, in logs
    proc_info = self.proc[queue_entry.stash_rev]
  File "/usr/lib/python3.10/contextlib.py", line 78, in inner
    with self._recreate_cm():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/funcy/flow.py", line 88, in reraise
    raise into from e
KeyError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/commands/queue/logs.py", line 14, in run
    self.repo.experiments.celery_queue.logs(
  File "/home/warley/.virtualenvs/toledo/lib/python3.10/site-packages/dvc/repo/experiments/queue/celery.py", line 458, in logs
    raise DvcException(  # noqa: B904
dvc.exceptions.DvcException: No output logs found for experiment 'gusty-kibe'
vitalwarley commented 1 year ago

I managed to get the logs from one experiment

╚═╡(toledo) [17:34] λ dvc queue logs 00c825        
ERROR: failed to reproduce 'benchmark-ocr@validation': [Errno 2] No such file or directory: '/home/warley/dev/toledo/.dvc/tmp/exps/tmpgax00k6m/src/python/cvt/tools'
ERROR: failed to reproduce 'benchmark-ocr@validation': [Errno 2] No such file or directory: '/home/warley/dev/toledo/.dvc/tmp/exps/tmpgax00k6m/src/python/cvt/tools'

src/python/cvt/tools is a submodule inside another submodule; that is, src/python is one, while cvt is another. Could this be the problem?

In the dvc.yaml this path is specified as a dep

  benchmark-ocr:
    foreach:
      - test
      - validation
    do:
      cmd: >-
        python src/python/cvt/tools/main.py benchmark 
...
      deps:
        - src/python/cvt/tools
vitalwarley commented 1 year ago

I removed the dep and tried again

╚═╡(toledo) [18:14] λ dvc queue logs -f sixty-cyst
Following logs for experiment 'sixty-cyst'. Use Ctrl+C to stop following logs (experiment execution will continue).

ERROR: failed to reproduce 'benchmark-ocr@validation': [Errno 2] No such file or directory: '/home/warley/dev/toledo/.dvc/tmp/exps/tmp1mwvc_7b/datasets/container/cd/raw/validation/images'
ERROR: failed to reproduce 'benchmark-ocr@validation': [Errno 2] No such file or directory: '/home/warley/dev/toledo/.dvc/tmp/exps/tmp1mwvc_7b/datasets/container/cd/raw/validation/images'

This path is also listed as a dependency

      deps:
        - ${base.dataset-dir}/${base.scope}/${benchmark.cd-dir}/${item}/images
        - ${base.dataset-dir}/${base.scope}/${benchmark.cr-dir}/${item}/labels
        - ${base.models-dir}/${base.scope}/${benchmark.weights-cd}
        - ${base.models-dir}/${base.scope}/${benchmark.weights-cr}
pmrowla commented 1 year ago

@vitalwarley submodule dependencies are not supported in queue runs, see https://github.com/iterative/dvc/issues/7186

daavoo commented 1 year ago

@vitalwarley submodule dependencies are not supported in queue runs, see #7186

@pmrowla Would there be any unexpected side effect if --copy-paths is used as a workaround for submodules?

pmrowla commented 1 year ago

@daavoo I think that should be OK as a workaround for now.

Ideally we would handle submodules correctly when we generate the temp git workspace, but #7186 hasn't been addressed yet because it needs some more investigation/research. Basically I'm not currently sure what the correct way to handle unstaged or uncommitted changes in the submodule is, but technically I think we should be trying to include them in the experiment (and not just doing git submodule pull in the temp git workspace).

But copying the entire submodule with --copy-paths should at least give you the expected result in typical cases, with the caveat that we are only copying the state of the submodule at the time the experiment is actually run (which is not necessarily the same as the state when the user did exp run --queue)

vitalwarley commented 1 year ago

@vitalwarley submodule dependencies are not supported in queue runs, see #7186

Ah, I see. Thanks for the info.

@pmrowla Would there be any unexpected side effect if --copy-paths is used as a workaround for submodules?

I reverted the submodule dependency removal and tried this parameter. It seems to solve the first issue (submodule listed as dep there), but the other (submodule removed as dep there) one happened again.