iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.85k stars 1.18k forks source link

`dvc repro --dry --allow-missing`: fails on missing data #9818

Closed Otterpatsch closed 1 year ago

Otterpatsch commented 1 year ago

I tried to update our dvc ci pipeline

Currently we got the following commands (among others).

dvc pull to check if everything is pushed dvc status to check if the dvc status is clean. In other words no repro would be run if one would run dvc repro.

But pulling thats a long time and with the now new --alllow-missing feature i though i can skip that with

dvc data status --not-in-remote --json | grep -v not_in_remote
dvc repro --allow-missing --dry

the first is working like expected. Fails if data was forgotten to be pushed and succeeds if it was. But the later just fails on missing data.

Reproduce

Example: Failure/Success on Machine Two and Three should be synced

Machine One:

  1. dvc repro -f
  2. git add . && git commit -m "repro" && dvc push && git push
  3. dvc repro --allow-missing --dry --> doesnt fail, nothing changed (as expected)

Machine Two:

  1. dvc data status --not-in-remote --json | grep -v not_in_remote --> does not fail, everything is pushed and would be pulled
  2. dvc repro --allow-missing --dry --> fails on missing data (unexpected)

Machine Three

  1. dvc pull
  2. dvc status --> succeeds

Expected

On a machine where i did not dvc pull i would expect on a git clean state and a clean dvc data status --not-in-remote --json | grep -v not_in_remotestate that dvc repro --allow-missing --dry would succed and show me that no stage had to run.

Environment information

Linux

Output of dvc doctor:

$ dvc doctor
09:16:47  DVC version: 3.13.2 (pip)
09:16:47  -------------------------
09:16:47  Platform: Python 3.10.11 on Linux-5.9.0-0.bpo.5-amd64-x86_64-with-glibc2.35
09:16:47  Subprojects:
09:16:47    dvc_data = 2.12.1
09:16:47    dvc_objects = 0.24.1
09:16:47    dvc_render = 0.5.3
09:16:47    dvc_task = 0.3.0
09:16:47    scmrepo = 1.1.0
09:16:47  Supports:
09:16:47    azure (adlfs = 2023.4.0, knack = 0.11.0, azure-identity = 1.13.0),
09:16:47    gdrive (pydrive2 = 1.16.1),
09:16:47    gs (gcsfs = 2023.6.0),
09:16:47    hdfs (fsspec = 2023.6.0, pyarrow = 12.0.1),
09:16:47    http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
09:16:47    https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
09:16:47    oss (ossfs = 2021.8.0),
09:16:47    s3 (s3fs = 2023.6.0, boto3 = 1.28.17),
09:16:47    ssh (sshfs = 2023.7.0),
09:16:47    webdav (webdav4 = 0.9.8),
09:16:47    webdavs (webdav4 = 0.9.8),
09:16:47    webhdfs (fsspec = 2023.6.0)
09:16:47  Config:
09:16:47    Global: /home/runner/.config/dvc
09:16:47    System: /etc/xdg/dvc
09:16:47  Cache types: <https://error.dvc.org/no-dvc-cache>
09:16:47  Caches: local
09:16:47  Remotes: ssh
09:16:47  Workspace directory: ext4 on /dev/nvme0n1p2
09:16:47  Repo: dvc, git
dberenbaum commented 1 year ago

@Otterpatsch I see you provided some verbose output in https://discord.com/channels/485586884165107732/1138144206473396304/1138162073705128148, but I don't see any error there. Are you able to post output showing the full output, including the error you hit?

Otterpatsch commented 1 year ago

I dont hit any "error" just that notification due to --dry that staged would run. And further notification that some files are missing (dvc tracked). But maybe my assumation that dvc repro --allow-missing --dry should not fail/report everything is fine and uptodate when i use those flag, iff from some other machine that repro was done and pushed successfully is wrong. Im very much confused by now

Just to clarify if i run dvc pull and run dvc status everything is reported as fine.

dvc repro --allow-missing --dry
11:18:32  'datasets/benchmark-sets/customer0/2020_11_02.dvc' didn't change, skipping
...
11:18:32  'datasets/training-sets/customer/customerN/customerN_empty_consignment_field_faxified.dvc' didn't change, skipping
11:18:32  Running stage 'training':
11:18:32  > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
11:18:32  > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
11:18:32  > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
11:18:32  > cp -r stages/training/charsets model/
11:18:32  
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Ansprechpartner' didn't change, skipping
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Beinstueck' didn't change, skipping
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Kommission' didn't change, skipping
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Kundenname' didn't change, skipping
11:18:32  'datasets/benchmark-sets/company/emails_2021-03-22.dvc' didn't change, skipping
11:18:32  ERROR: failed to reproduce 'extract@company/emails_2021-03-22': [Errno 2] No such file or directory: '/var/jenkins_home/workspace/repo_namecompany_MR-20/datasets/benchmark-sets/company/emails_2021-03-22'
dberenbaum commented 1 year ago

It seems like this happens when there is a dependency on data that was tracked via dvc add. I can reproduce:

git clone https://github.com/iterative/example-get-started-experiments.git
cd example-get-started-experiments
dvc repro --allow-missing --dry

Verbose output:

$ dvc repro -v --allow-missing --dry
2023-08-10 11:15:25,325 DEBUG: v3.14.1.dev2+g04e891cef, CPython 3.11.4 on macOS-13.4.1-arm64-arm-64bit
2023-08-10 11:15:25,325 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc repro -v --allow-missing --dry
2023-08-10 11:15:25,709 DEBUG: Computed stage: 'data/pool_data.dvc' md5: 'None'
'data/pool_data.dvc' didn't change, skipping
2023-08-10 11:15:25,711 DEBUG: Dependency 'data/pool_data' of stage: 'data_split' changed because it is 'modified'.
2023-08-10 11:15:25,712 DEBUG: stage: 'data_split' changed.
2023-08-10 11:15:25,714 ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data'
Traceback (most recent call last):
  File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 199, in _reproduce
    ret = repro_fn(stage, upstream=upstream, force=force_stage, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 129, in _reproduce_stage
    ret = stage.reproduce(**kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
    return call()
           ^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 433, in reproduce
    self.run(**kwargs)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
    return call()
           ^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 599, in run
    self._run_stage(dry, force, allow_missing=allow_missing, **kwargs)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
    return call()
           ^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 630, in _run_stage
    return run_stage(self, dry, force, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/run.py", line 134, in run_stage
    stage.repo.stage_cache.restore(stage, dry=dry, **kwargs)
  File "/Users/dave/Code/dvc/dvc/stage/cache.py", line 188, in restore
    if not _can_hash(stage):
           ^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/cache.py", line 43, in _can_hash
    if not (dep.protocol == "local" and dep.def_path and dep.get_hash()):
                                                         ^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/output.py", line 553, in get_hash
    _, hash_info = self._get_hash_meta()
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/output.py", line 573, in _get_hash_meta
    _, meta, obj = self._build(
                   ^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/output.py", line 566, in _build
    return build(*args, callback=pb.as_callback(), **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/build.py", line 233, in build
    details = fs.info(path)
              ^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc-objects/src/dvc_objects/fs/base.py", line 495, in info
    return self.fs.info(path, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc-objects/src/dvc_objects/fs/local.py", line 42, in info
    return self.fs.info(path)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/implementations/local.py", line 87, in info
    out = os.stat(path, follow_symlinks=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/dave/Code/dvc/dvc/cli/__init__.py", line 209, in main
    ret = cmd.do_run()
          ^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/cli/command.py", line 26, in do_run
    return self.run()
           ^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/commands/repro.py", line 13, in run
    stages = self.repo.reproduce(**self._common_kwargs, **self._repro_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 64, in wrapper
    return f(repo, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/scm_context.py", line 151, in run
    return method(repo, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 260, in reproduce
    return _reproduce(steps, graph=graph, on_error=on_error or "fail", **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 203, in _reproduce
    _raise_error(exc, stage)
  File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 167, in _raise_error
    raise ReproductionError(f"failed to reproduce{segment} {names}") from exc
dvc.exceptions.ReproductionError: failed to reproduce 'data_split'

2023-08-10 11:15:25,721 DEBUG: Analytics is disabled.
dberenbaum commented 1 year ago

Looks like it is failing in my example because data/pool_data.dvc is in legacy 2.x format, so the hash info doesn't match the stage dep here:

https://github.com/iterative/dvc/blob/04e891cef929567794ade4e0c2a1bf399666f66e/dvc/stage/__init__.py#L315-L321

The hashes are the same, but debugging shows that the different hash names make it fail:

(Pdb) out.hash_info
HashInfo(name='md5-dos2unix', value='14d187e749ee5614e105741c719fa185.dir', obj_name=None)
(Pdb) dep.hash_info
HashInfo(name='md5', value='14d187e749ee5614e105741c719fa185.dir', obj_name=None)

@Otterpatsch Does datasets/benchmark-sets/customer0/2020_11_02.dvc contain the line hash: md5 (that line is only present in 3.x files)? Also, could you try to delete the site cache dir?

dberenbaum commented 1 year ago

Looks like it is failing in my example because data/pool_data.dvc is in legacy 2.x format, so the hash info doesn't match the stage dep here:

@iterative/dvc Thoughts on how we should treat this? Is it modified or not?

daavoo commented 1 year ago

Looks like it is failing in my example because data/pool_data.dvc is in legacy 2.x format, so the hash info doesn't match the stage dep here:

@iterative/dvc Thoughts on how we should treat this? Is it modified or not?

IMO, it was an overlook for this scenario

dberenbaum commented 1 year ago

@daavoo What does that mean? Do you think we should only compare the hash value and not all hash info?

daavoo commented 1 year ago

@daavoo What does that mean?

I mean that we should not consider it modified in the example-get-started-experiments scenario.

Do you think we should only compare the hash value and not all hash info?

Can't say from the top of my mind. Would need to take a closer look to see what makes sense

Otterpatsch commented 1 year ago

Does datasets/benchmark-sets/customer0/2020_11_02.dvc contain the line hash: md5 (that line is only present in 3.x files)?

outs:
- md5: f4eb1691cb23a5160a958274b9b9fb41.dir
size: 55860614
nfiles: 5491
path: '2020_11_02'

seems it does

Also, could you try to delete the site cache dir?

With deleting the /var/tmp/dvc (was existing) error persists

daavoo commented 1 year ago

So, to give context, the problem appears if there is a .dvc file in 2.X format:

https://github.com/iterative/example-get-started-experiments/blob/9dba21cbffb0caad939c63db427eea7f16f3c269/data/pool_data.dvc#L1-L5

That is referenced in a dvc.lock in 3.X format as a dependency:

https://github.com/iterative/example-get-started-experiments/blob/9dba21cbffb0caad939c63db427eea7f16f3c269/dvc.lock#L6-L10

As soon as the contents associated with the .dvc are updated, the file will be updated to 3.X format so the problem would disappear.

Do you think we should only compare the hash value and not all hash info? Can't say from the top of my mind. Would need to take a closer look to see what makes sense

Strictly speaking, I guess there could be a collision where we would be miss identifying 2 different things as being the same 🤷

efiop commented 1 year ago

As soon as the contents associated with the .dvc are updated, the file will be updated to 3.X format so the problem would disappear.

@Otterpatsch Is it possible to just force-commit for you to upgrade those hashes? We can't really compare those without computing both, which is undesirable. Seems like just upgrading old lock file should be an easy long-term fix.

Otterpatsch commented 1 year ago

How do i upgrade the hashes?

dberenbaum commented 1 year ago

@Otterpatsch You can do dvc commit -f to upgrade the hashes.

dberenbaum commented 1 year ago

@daavoo Are you planning a PR to fix the dvc commit -f behavior?

@Otterpatsch Are you still working through this problem? It turns out that dvc commit -f won't fix it for you currently. The best workaround for now would be to do dvc remove datasets/benchmark-sets/customer0/2020_11_02.dvc followed by dvc add datasets/benchmark-sets/customer0/2020_11_02.dvc.

daavoo commented 1 year ago

@daavoo Are you planning a PR to fix the dvc commit -f behavior?

yes

Otterpatsch commented 1 year ago

@daavoo Are you planning a PR to fix the dvc commit -f behavior?

@Otterpatsch Are you still working through this problem? It turns out that dvc commit -f won't fix it for you currently. The best workaround for now would be to do dvc remove datasets/benchmark-sets/customer0/2020_11_02.dvc followed by dvc add datasets/benchmark-sets/customer0/2020_11_02.dvc.

alright we will test that. But currently we just rolled back to just jusing dvc pull and dvc status (close to a hour). Yeah dvc commit -f did somethings but pipeline was still failing but I wasnt sure if we had some other issues so i tried to find those. As soon as the dvc commit -f fix is implemented should this in theory fix also this issue (when dvc commit -f is run and commited ofc)?

dberenbaum commented 1 year ago

Once datasets/benchmark-sets/customer0/2020_11_02.dvc is updated to use the 3.0 cache (you should see the field hash: md5 in that file), then it should fix this issue. If you do dvc remove datasets/benchmark-sets/customer0/2020_11_02.dvc; dvc add datasets/benchmark-sets/customer0/2020_11_02.dvc, it should work now.

Otterpatsch commented 1 year ago

So i fixed the issue (i think) on our side. I basically run dvc repro --allow-missing --dry couple of times to get each time one of the datasets which where still dvc2. Then i readd those and not anymore crashing.

But now the pipeline succeeds even tho i get a the following lines in the command. Which makes sense because i changed a lot of .dvc files which are also in that path.

13:57:33  2023-08-21 11:57:24,369 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,370 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
13:57:33  2023-08-21 11:57:24,371 DEBUG: stage: 'training' changed.
13:57:33  2023-08-21 11:57:24,384 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,386 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33  2023-08-21 11:57:24,397 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,397 DEBUG: {'datasets/training-sets': 'modified'}
13:57:33  2023-08-21 11:57:24,408 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,409 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33  Running stage 'training':
13:57:33  > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
13:57:33  > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
13:57:33  > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
13:57:33  > cp -r stages/training/charsets model/
13:57:33  2023-08-21 11:57:24,412 DEBUG: stage: 'training' was reproduced

How can i fix this? Like it seems that i dont use the correct command for my pipeline. I mean the command succeeds but it should fail in a pipeline sense because a repro would be run if i just would use dvc repro.

I believe im missing something similar to the dvc data status one dvc data status --not-in-remote --json | grep -v not_in_remote

which got the grep but not sure how do it for dvc repro --allow-missing --dry so it failes for all kinds of the dependecies.

So i tried: dvc repro --dry --allow-missing | grep -v "Running stage " But it still succeds even tho if i just use grep "Running stage " i get some output

> dvc repro --dry --allow-missing | grep "Running stage "
Running stage 'training':
Running stage 'collect_benchmarks':
dberenbaum commented 1 year ago

dvc commit -f also seems like it would be useful after running dvc cache migrate to ensure that all dvc files reference the migrated 3.x cache. See https://discord.com/channels/485586884165107732/563406153334128681/1149449470480760982.

Otterpatsch commented 1 year ago

The dvc cache migrate outputs that no file changed in the the cache. Even tho the dvc commit -f did something again but i also rebased the branch which intruce those pipeline changes. But event commit those changes the pipeline still fails.

I also run dvc cache migrate on the ci machine. I doenst apply any changes. Also the after each run everything is cleared (but wanted to check anyway)

Sadly the pipeline still fails on the dvc repro --dry --allow-missing | grep -vz "Running". Which a dvc pull it doenst fail

So the following output confused me a lot as there a .dvc files with no md5sum? even tho i run the commands you mentioned. So i do i get rid of the dvc2 .dvc files and replace them with their dvc3 counterpart?

13:44:20  2023-09-11 11:44:12,198 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/training-2023_08_08-LYD-consignments.dvc' md5: 'a9c8f1cf1840f743123f169bba789ac1'
13:44:20  'datasets/training-sets/customer/CustomerName/training-2023_08_08-LYD-consignments.dvc' didn't change, skipping
13:44:20  2023-09-11 11:44:12,201 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field.dvc' md5: 'None'
13:44:20  'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field.dvc' didn't change, skipping
13:44:20  2023-09-11 11:44:12,206 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/training-2020_08_22-alpha-referenceOrder.dvc' md5: 'None'
13:44:20  'datasets/training-sets/customer/CustomerName/training-2020_08_22-alpha-referenceOrder.dvc' didn't change, skipping
13:44:20  2023-09-11 11:44:12,210 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field_2020-07-14.dvc' md5: 'None'
13:44:20  'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field_2020-07-14.dvc' didn't change, skipping
13:44:20  2023-09-11 11:44:12,214 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/CustomerName_empty_consignment_field_faxified.dvc' md5: 'None'
13:44:20  'datasets/training-sets/customer/CustomerName/CustomerName_empty_consignment_field_faxified.dvc' didn't change, skipping
13:44:20  2023-09-11 11:44:12,255 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20  2023-09-11 11:44:12,256 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
13:44:20  2023-09-11 11:44:12,257 DEBUG: stage: 'training' changed.
13:44:20  2023-09-11 11:44:12,271 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20  2023-09-11 11:44:12,273 DEBUG: built tree 'object f203ea8d0a44649090eb4d3debd6ed8d.dir'
13:44:20  2023-09-11 11:44:12,286 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20  2023-09-11 11:44:12,286 DEBUG: {'datasets/training-sets': 'modified'}
13:44:20  2023-09-11 11:44:12,298 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20  2023-09-11 11:44:12,300 DEBUG: built tree 'object f203ea8d0a44649090eb4d3debd6ed8d.dir'
13:44:20  Running stage 'training':
13:44:20  > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
13:44:20  > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
13:44:20  > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
13:44:20  > cp -r stages/training/charsets model/
13:44:20  2023-09-11 11:44:12,302 DEBUG: stage: 'training' was reproduced
dberenbaum commented 1 year ago

So i do i get rid of the dvc2 .dvc files and replace them with their dvc3 counterpart?

Yes, sorry for the confusion @Otterpatsch. I initially thought dvc commit -f would achieve that, but it doesn't do that today. We are looking into changing that, but for now you would need to do this yourself.

Otterpatsch commented 1 year ago

So i do i get rid of the dvc2 .dvc files and replace them with their dvc3 counterpart?

Yes, sorry for the confusion @Otterpatsch. I initially thought dvc commit -f would achieve that, but it doesn't do that today. We are looking into changing that, but for now you would need to do this yourself.

So i just tried to do that (with version 3.22.0)

dvc repro --dry --allow-missing to detect all "bad" dvc files. E.g datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc then i run rm datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc to get rid of the bad file and dvc add datasets/benchmark-sets/SomeCompanyName/2020_11_02 to readd the directory. See below the git diff (at line fine hash: md5 was inserted).

 ■■ datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc                                                               
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
@@ -2,4 +2,5 @@ outs:                                                                                               
    2    - md5: f4eb1691cb23a5160a958274b9b9fb41.dir                                                                
    3      size: 55860614                                                                                           
    4      nfiles: 5491                                                                                             
    5  +   hash: md5                                                                                                
    5      path: '2020_11_02'  

Now i expected if i run dvc repro --dry --allow-missing to not have an have the output md5: 'None for that one specific file. But i still do get the same output as earlier

> dvc repro --dry --allow-missing --verbose | grep -P "datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc"  
2023-09-20 12:34:13,083 DEBUG: Computed stage: 'datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc' md5: 'None'
dberenbaum commented 1 year ago

That debug calls from here:

https://github.com/iterative/dvc/blob/b85608121f21e53298aa3c03dae9bf091174b150/dvc/stage/__init__.py#L466-L473

It will only show a non-empty md5 for an actual stage, not a .dvc-tracked data source. The check for --allow-missing is separate and comes later, so this is expected.

Is dvc repro --dry --allow-missing skipping the stage/working as expected?

dberenbaum commented 1 year ago

Closing since I haven't heard back but feel free to reopen if you still have issues @Otterpatsch