iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.95k stars 1.19k forks source link

dvc move: loses deps when moving imported .dvc files. #7847

Open johnyaku opened 2 years ago

johnyaku commented 2 years ago

Bug Report

Description

When dvc move is applied to data imported via dvc import the data files are moved successfully and a new .dvc file is created in the target location BUT the new .dvc file no longer contains the deps: section (repo url, etc).

Reproduce

dvc import <repo url> directory
dvc move directory new_path

Expected

New .dvc file should retain deps: information.

Environment information

Output of dvc doctor:

DVC version: 2.9.3 (conda)
---------------------------------
Platform: Python 3.10.2 on Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.17
Supports:
        webhdfs (fsspec = 2022.1.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        ssh (sshfs = 2021.11.2)
Cache types: symlink
Cache directory: panfs on panfs://...
Caches: local
Remotes: ssh, ssh, ssh
Workspace directory: panfs on panfs://...
Repo: dvc, git
johnyaku commented 2 years ago

Very keen to have this fixed. Might even delve into the code and make a PR. In the meantime, I use the following function instead of dvc move:

function dvc_mv () {
    source=$1
    destn=$2
    cp $source.dvc $source.dvc.bk
    dvc move $source $destn
    if [[ -f $destn ]]; then
        mv $source.dvc.bk $destn.dvc
    elif [[ -d $destn ]]; then
        mv $source.dvc.bk $destn/$(basename $source).dvc
    fi
}

By putting this in a script file inside the etc/conda/activate.d subdirectory of the environment prefix, it is available when the environment is activated. Note sure how robust it is, or whether it covers all cases, but it seems to be working for now.

12michi34 commented 1 year ago

Just hit the same issue: was expecting as well that deps entries get preserved. @johnyaku thanx for the function. Only thing I noticed is that the "outs" section in the renamed .dvc file still contains the original file name while a std "dvc move" command would also change that field.

johnyaku commented 1 year ago

Thanks @12michi34 . I guess the function could be a bit more robust, and I'm open to suggestions, but I've actually found it quicker to simply delete the original .dvc file and import again, especially when importing a large directory. If you take this approach then you may have to manually manage .gitignore files. Specifically, the changes created as a result of the new import should do what you want, but any left over specifications in .gitignore files might cause headaches later (or not). DVC creates a lot of .gitignore files so best check and clean immediately if you take this approach.

If you want to use my function, then I'd better clarify the semantics of the second parameter (the "destination") which is renamed $destn. This can be either a file or a directory. If you specify a file, then the name of the new .dvc file will be based on this parameter. In your case, this sounds like exactly what you want. On the other hand, if you specify a directory then a file with the same name as the original .dvc file will be created in that directory.

Of couse, there might also be bugs or scenarios that I didn't consider. Please reach out if so, altho it would be better if this was fixed within DVC.

12michi34 commented 1 year ago

hi @johnyaku . I tried just quickly added a sed -i "s/$source/$destn/g" $destn.dvc at to replace the string in the "file" case which works for me. But yes it would be great to get a proper DVC fix.