Closed Otterpatsch closed 1 year ago
@Otterpatsch I see you provided some verbose output in https://discord.com/channels/485586884165107732/1138144206473396304/1138162073705128148
, but I don't see any error there. Are you able to post output showing the full output, including the error you hit?
I dont hit any "error" just that notification due to --dry that staged would run. And further notification that some files are missing (dvc tracked).
But maybe my assumation that dvc repro --allow-missing --dry
should not fail/report everything is fine and uptodate when i use those flag, iff from some other machine that repro was done and pushed successfully is wrong.
Im very much confused by now
Just to clarify if i run dvc pull and run dvc status everything is reported as fine.
dvc repro --allow-missing --dry
11:18:32 'datasets/benchmark-sets/customer0/2020_11_02.dvc' didn't change, skipping
...
11:18:32 'datasets/training-sets/customer/customerN/customerN_empty_consignment_field_faxified.dvc' didn't change, skipping
11:18:32 Running stage 'training':
11:18:32 > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
11:18:32 > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
11:18:32 > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
11:18:32 > cp -r stages/training/charsets model/
11:18:32
11:18:32 Stage 'extract@customer0/2020_11_02/Formularmerkmal_Ansprechpartner' didn't change, skipping
11:18:32 Stage 'extract@customer0/2020_11_02/Formularmerkmal_Beinstueck' didn't change, skipping
11:18:32 Stage 'extract@customer0/2020_11_02/Formularmerkmal_Kommission' didn't change, skipping
11:18:32 Stage 'extract@customer0/2020_11_02/Formularmerkmal_Kundenname' didn't change, skipping
11:18:32 'datasets/benchmark-sets/company/emails_2021-03-22.dvc' didn't change, skipping
11:18:32 ERROR: failed to reproduce 'extract@company/emails_2021-03-22': [Errno 2] No such file or directory: '/var/jenkins_home/workspace/repo_namecompany_MR-20/datasets/benchmark-sets/company/emails_2021-03-22'
It seems like this happens when there is a dependency on data that was tracked via dvc add
. I can reproduce:
git clone https://github.com/iterative/example-get-started-experiments.git
cd example-get-started-experiments
dvc repro --allow-missing --dry
Verbose output:
$ dvc repro -v --allow-missing --dry
2023-08-10 11:15:25,325 DEBUG: v3.14.1.dev2+g04e891cef, CPython 3.11.4 on macOS-13.4.1-arm64-arm-64bit
2023-08-10 11:15:25,325 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc repro -v --allow-missing --dry
2023-08-10 11:15:25,709 DEBUG: Computed stage: 'data/pool_data.dvc' md5: 'None'
'data/pool_data.dvc' didn't change, skipping
2023-08-10 11:15:25,711 DEBUG: Dependency 'data/pool_data' of stage: 'data_split' changed because it is 'modified'.
2023-08-10 11:15:25,712 DEBUG: stage: 'data_split' changed.
2023-08-10 11:15:25,714 ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data'
Traceback (most recent call last):
File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 199, in _reproduce
ret = repro_fn(stage, upstream=upstream, force=force_stage, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 129, in _reproduce_stage
ret = stage.reproduce(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
return deco(call, *dargs, **dkwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
return call()
^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
return self._func(*self._args, **self._kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 433, in reproduce
self.run(**kwargs)
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
return deco(call, *dargs, **dkwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
return call()
^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
return self._func(*self._args, **self._kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 599, in run
self._run_stage(dry, force, allow_missing=allow_missing, **kwargs)
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
return deco(call, *dargs, **dkwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
return call()
^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
return self._func(*self._args, **self._kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 630, in _run_stage
return run_stage(self, dry, force, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/stage/run.py", line 134, in run_stage
stage.repo.stage_cache.restore(stage, dry=dry, **kwargs)
File "/Users/dave/Code/dvc/dvc/stage/cache.py", line 188, in restore
if not _can_hash(stage):
^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/stage/cache.py", line 43, in _can_hash
if not (dep.protocol == "local" and dep.def_path and dep.get_hash()):
^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/output.py", line 553, in get_hash
_, hash_info = self._get_hash_meta()
^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/output.py", line 573, in _get_hash_meta
_, meta, obj = self._build(
^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/output.py", line 566, in _build
return build(*args, callback=pb.as_callback(), **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/build.py", line 233, in build
details = fs.info(path)
^^^^^^^^^^^^^
File "/Users/dave/Code/dvc-objects/src/dvc_objects/fs/base.py", line 495, in info
return self.fs.info(path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc-objects/src/dvc_objects/fs/local.py", line 42, in info
return self.fs.info(path)
^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/implementations/local.py", line 87, in info
out = os.stat(path, follow_symlinks=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/dave/Code/dvc/dvc/cli/__init__.py", line 209, in main
ret = cmd.do_run()
^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/cli/command.py", line 26, in do_run
return self.run()
^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/commands/repro.py", line 13, in run
stages = self.repo.reproduce(**self._common_kwargs, **self._repro_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 64, in wrapper
return f(repo, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/repo/scm_context.py", line 151, in run
return method(repo, *args, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 260, in reproduce
return _reproduce(steps, graph=graph, on_error=on_error or "fail", **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 203, in _reproduce
_raise_error(exc, stage)
File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 167, in _raise_error
raise ReproductionError(f"failed to reproduce{segment} {names}") from exc
dvc.exceptions.ReproductionError: failed to reproduce 'data_split'
2023-08-10 11:15:25,721 DEBUG: Analytics is disabled.
Looks like it is failing in my example because data/pool_data.dvc
is in legacy 2.x format, so the hash info doesn't match the stage dep here:
The hashes are the same, but debugging shows that the different hash names make it fail:
(Pdb) out.hash_info
HashInfo(name='md5-dos2unix', value='14d187e749ee5614e105741c719fa185.dir', obj_name=None)
(Pdb) dep.hash_info
HashInfo(name='md5', value='14d187e749ee5614e105741c719fa185.dir', obj_name=None)
@Otterpatsch Does datasets/benchmark-sets/customer0/2020_11_02.dvc
contain the line hash: md5
(that line is only present in 3.x files)? Also, could you try to delete the site cache dir?
Looks like it is failing in my example because
data/pool_data.dvc
is in legacy 2.x format, so the hash info doesn't match the stage dep here:
@iterative/dvc Thoughts on how we should treat this? Is it modified or not?
Looks like it is failing in my example because
data/pool_data.dvc
is in legacy 2.x format, so the hash info doesn't match the stage dep here:@iterative/dvc Thoughts on how we should treat this? Is it modified or not?
IMO, it was an overlook for this scenario
@daavoo What does that mean? Do you think we should only compare the hash value and not all hash info?
@daavoo What does that mean?
I mean that we should not consider it modified in the example-get-started-experiments scenario.
Do you think we should only compare the hash value and not all hash info?
Can't say from the top of my mind. Would need to take a closer look to see what makes sense
Does datasets/benchmark-sets/customer0/2020_11_02.dvc contain the line hash: md5 (that line is only present in 3.x files)?
outs: - md5: f4eb1691cb23a5160a958274b9b9fb41.dir size: 55860614 nfiles: 5491 path: '2020_11_02'
seems it does
Also, could you try to delete the site cache dir?
With deleting the /var/tmp/dvc (was existing) error persists
So, to give context, the problem appears if there is a .dvc
file in 2.X format:
That is referenced in a dvc.lock
in 3.X format as a dependency:
As soon as the contents associated with the .dvc
are updated, the file will be updated to 3.X
format so the problem would disappear.
Do you think we should only compare the hash value and not all hash info? Can't say from the top of my mind. Would need to take a closer look to see what makes sense
Strictly speaking, I guess there could be a collision where we would be miss identifying 2 different things as being the same 🤷
As soon as the contents associated with the .dvc are updated, the file will be updated to 3.X format so the problem would disappear.
@Otterpatsch Is it possible to just force-commit for you to upgrade those hashes? We can't really compare those without computing both, which is undesirable. Seems like just upgrading old lock file should be an easy long-term fix.
How do i upgrade the hashes?
@Otterpatsch You can do dvc commit -f
to upgrade the hashes.
@daavoo Are you planning a PR to fix the dvc commit -f
behavior?
@Otterpatsch Are you still working through this problem? It turns out that dvc commit -f
won't fix it for you currently. The best workaround for now would be to do dvc remove datasets/benchmark-sets/customer0/2020_11_02.dvc
followed by dvc add datasets/benchmark-sets/customer0/2020_11_02.dvc
.
@daavoo Are you planning a PR to fix the
dvc commit -f
behavior?
yes
@daavoo Are you planning a PR to fix the
dvc commit -f
behavior?@Otterpatsch Are you still working through this problem? It turns out that
dvc commit -f
won't fix it for you currently. The best workaround for now would be to dodvc remove datasets/benchmark-sets/customer0/2020_11_02.dvc
followed bydvc add datasets/benchmark-sets/customer0/2020_11_02.dvc
.
alright we will test that. But currently we just rolled back to just jusing dvc pull and dvc status (close to a hour).
Yeah dvc commit -f did somethings but pipeline was still failing but I wasnt sure if we had some other issues so i tried to find those.
As soon as the dvc commit -f
fix is implemented should this in theory fix also this issue (when dvc commit -f is run and commited ofc)?
Once datasets/benchmark-sets/customer0/2020_11_02.dvc
is updated to use the 3.0 cache (you should see the field hash: md5
in that file), then it should fix this issue. If you do dvc remove datasets/benchmark-sets/customer0/2020_11_02.dvc; dvc add datasets/benchmark-sets/customer0/2020_11_02.dvc
, it should work now.
So i fixed the issue (i think) on our side. I basically run dvc repro --allow-missing --dry
couple of times to get each time one of the datasets which where still dvc2. Then i readd those and not anymore crashing.
But now the pipeline succeeds even tho i get a the following lines in the command. Which makes sense because i changed a lot of .dvc files which are also in that path.
13:57:33 2023-08-21 11:57:24,369 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33 2023-08-21 11:57:24,370 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
13:57:33 2023-08-21 11:57:24,371 DEBUG: stage: 'training' changed.
13:57:33 2023-08-21 11:57:24,384 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33 2023-08-21 11:57:24,386 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33 2023-08-21 11:57:24,397 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33 2023-08-21 11:57:24,397 DEBUG: {'datasets/training-sets': 'modified'}
13:57:33 2023-08-21 11:57:24,408 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33 2023-08-21 11:57:24,409 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33 Running stage 'training':
13:57:33 > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
13:57:33 > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
13:57:33 > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
13:57:33 > cp -r stages/training/charsets model/
13:57:33 2023-08-21 11:57:24,412 DEBUG: stage: 'training' was reproduced
How can i fix this? Like it seems that i dont use the correct command for my pipeline. I mean the command succeeds but it should fail in a pipeline sense because a repro would be run if i just would use dvc repro
.
I believe im missing something similar to the dvc data status one
dvc data status --not-in-remote --json | grep -v not_in_remote
which got the grep but not sure how do it for dvc repro --allow-missing --dry so it failes for all kinds of the dependecies.
So i tried:
dvc repro --dry --allow-missing | grep -v "Running stage "
But it still succeds even tho if i just use grep "Running stage " i get some output
> dvc repro --dry --allow-missing | grep "Running stage "
Running stage 'training':
Running stage 'collect_benchmarks':
dvc commit -f
also seems like it would be useful after running dvc cache migrate
to ensure that all dvc files reference the migrated 3.x cache. See https://discord.com/channels/485586884165107732/563406153334128681/1149449470480760982.
The dvc cache migrate outputs that no file changed in the the cache. Even tho the dvc commit -f
did something again but i also rebased the branch which intruce those pipeline changes. But event commit those changes the pipeline still fails.
I also run dvc cache migrate on the ci machine. I doenst apply any changes. Also the after each run everything is cleared (but wanted to check anyway)
Sadly the pipeline still fails on the dvc repro --dry --allow-missing | grep -vz "Running"
. Which a dvc pull it doenst fail
So the following output confused me a lot as there a .dvc files with no md5sum? even tho i run the commands you mentioned. So i do i get rid of the dvc2 .dvc files and replace them with their dvc3 counterpart?
13:44:20 2023-09-11 11:44:12,198 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/training-2023_08_08-LYD-consignments.dvc' md5: 'a9c8f1cf1840f743123f169bba789ac1'
13:44:20 'datasets/training-sets/customer/CustomerName/training-2023_08_08-LYD-consignments.dvc' didn't change, skipping
13:44:20 2023-09-11 11:44:12,201 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field.dvc' md5: 'None'
13:44:20 'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field.dvc' didn't change, skipping
13:44:20 2023-09-11 11:44:12,206 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/training-2020_08_22-alpha-referenceOrder.dvc' md5: 'None'
13:44:20 'datasets/training-sets/customer/CustomerName/training-2020_08_22-alpha-referenceOrder.dvc' didn't change, skipping
13:44:20 2023-09-11 11:44:12,210 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field_2020-07-14.dvc' md5: 'None'
13:44:20 'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field_2020-07-14.dvc' didn't change, skipping
13:44:20 2023-09-11 11:44:12,214 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/CustomerName_empty_consignment_field_faxified.dvc' md5: 'None'
13:44:20 'datasets/training-sets/customer/CustomerName/CustomerName_empty_consignment_field_faxified.dvc' didn't change, skipping
13:44:20 2023-09-11 11:44:12,255 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20 2023-09-11 11:44:12,256 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
13:44:20 2023-09-11 11:44:12,257 DEBUG: stage: 'training' changed.
13:44:20 2023-09-11 11:44:12,271 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20 2023-09-11 11:44:12,273 DEBUG: built tree 'object f203ea8d0a44649090eb4d3debd6ed8d.dir'
13:44:20 2023-09-11 11:44:12,286 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20 2023-09-11 11:44:12,286 DEBUG: {'datasets/training-sets': 'modified'}
13:44:20 2023-09-11 11:44:12,298 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20 2023-09-11 11:44:12,300 DEBUG: built tree 'object f203ea8d0a44649090eb4d3debd6ed8d.dir'
13:44:20 Running stage 'training':
13:44:20 > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
13:44:20 > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
13:44:20 > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
13:44:20 > cp -r stages/training/charsets model/
13:44:20 2023-09-11 11:44:12,302 DEBUG: stage: 'training' was reproduced
So i do i get rid of the dvc2 .dvc files and replace them with their dvc3 counterpart?
Yes, sorry for the confusion @Otterpatsch. I initially thought dvc commit -f
would achieve that, but it doesn't do that today. We are looking into changing that, but for now you would need to do this yourself.
So i do i get rid of the dvc2 .dvc files and replace them with their dvc3 counterpart?
Yes, sorry for the confusion @Otterpatsch. I initially thought
dvc commit -f
would achieve that, but it doesn't do that today. We are looking into changing that, but for now you would need to do this yourself.
So i just tried to do that (with version 3.22.0)
dvc repro --dry --allow-missing
to detect all "bad" dvc files. E.g datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc
then i run rm datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc
to get rid of the bad file and dvc add datasets/benchmark-sets/SomeCompanyName/2020_11_02
to readd the directory. See below the git diff (at line fine hash: md5 was inserted).
■■ datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
@@ -2,4 +2,5 @@ outs:
2 - md5: f4eb1691cb23a5160a958274b9b9fb41.dir
3 size: 55860614
4 nfiles: 5491
5 + hash: md5
5 path: '2020_11_02'
Now i expected if i run dvc repro --dry --allow-missing to not have an have the output md5: 'None for that one specific file. But i still do get the same output as earlier
> dvc repro --dry --allow-missing --verbose | grep -P "datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc"
2023-09-20 12:34:13,083 DEBUG: Computed stage: 'datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc' md5: 'None'
That debug calls from here:
It will only show a non-empty md5 for an actual stage, not a .dvc
-tracked data source. The check for --allow-missing
is separate and comes later, so this is expected.
Is dvc repro --dry --allow-missing
skipping the stage/working as expected?
Closing since I haven't heard back but feel free to reopen if you still have issues @Otterpatsch
I tried to update our dvc ci pipeline
Currently we got the following commands (among others).
dvc pull
to check if everything is pusheddvc status
to check if the dvc status is clean. In other words no repro would be run if one would run dvc repro.But pulling thats a long time and with the now new --alllow-missing feature i though i can skip that with
the first is working like expected. Fails if data was forgotten to be pushed and succeeds if it was. But the later just fails on missing data.
Reproduce
Example: Failure/Success on Machine Two and Three should be synced
Machine One:
Machine Two:
Machine Three
Expected
On a machine where i did not
dvc pull
i would expect on a git clean state and a cleandvc data status --not-in-remote --json | grep -v not_in_remote
state thatdvc repro --allow-missing --dry
would succed and show me that no stage had to run.Environment information
Linux
Output of
dvc doctor
: