iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.96k stars 1.19k forks source link

add: Repeated `dvc add` clears executable flag from files (`cache.type = copy`) #7619

Open meowcat opened 2 years ago

meowcat commented 2 years ago

Bug Report

Description

With cache.type = copy, executing dvc add -R twice clears the executable flags from the added files. The first time, isexec is correctly added to the corresponding .dvc file; the second time, the files lose their executable flag and isexec is removed from the .dvc file.

Reproduce

#!/bin/bash

# create a repo with two files
mkdir dvc-reprex
cd dvc-reprex
mkdir dvcfiles
cd dvcfiles
echo 123 > f1
echo 234 > f2
cd ..
# Set the two files to executable
chmod a+x dvcfiles/*
# Verify
ls -lap dvcfiles

# Init git and dvc repo with cache.type copy
git init
dvc init
dvc config cache.type copy
git add .dvc/config 
git commit -m "setup dvc"

# add the dvcfiles dir
dvc add -R dvcfiles
git add dvcfiles
git commit -m "committed dvcfiles"

# Check that files are still executable:
ls -lap dvcfiles
# All is fine:
# total 20
# drwxr-xr-x 1 zzz zzz 52 Apr 22 15:11 ./
# drwxr-xr-x 1 zzz zzz 52 Apr 22 15:01 ../
# -rwxr--r-- 1 zzz zzz  4 Apr 22 14:59 f1
# -rw-r--r-- 1 zzz zzz 82 Apr 22 15:07 f1.dvc
# -rwxr--r-- 1 zzz zzz  4 Apr 22 15:00 f2
# -rw-r--r-- 1 zzz zzz 82 Apr 22 15:07 f2.dvc
# -rw-r--r-- 1 zzz zzz  8 Apr 22 15:07 .gitignore

# DVC is clean:
dvc status
# Data and pipelines are up to date.

# dvc add again, which should do nothing IMO:
dvc add -R dvcfiles

# However files are not executable now:
ls -lap dvcfiles
# total 20
# drwxr-xr-x 1 zzz zzz 52 Apr 22 15:11 ./
# drwxr-xr-x 1 zzz zzz 52 Apr 22 15:01 ../
# -rw-r--r-- 1 zzz zzz  4 Apr 22 14:59 f1
# -rw-r--r-- 1 zzz zzz 67 Apr 22 15:11 f1.dvc
# -rw-r--r-- 1 zzz zzz  4 Apr 22 15:00 f2
# -rw-r--r-- 1 zzz zzz 67 Apr 22 15:11 f2.dvc
# -rw-r--r-- 1 zzz zzz  8 Apr 22 15:07 .gitignore

git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git restore <file>..." to discard changes in working directory)
#         modified:   dvcfiles/f1.dvc
#         modified:   dvcfiles/f2.dvc
git diff
# diff --git a/dvcfiles/f1.dvc b/dvcfiles/f1.dvc
# index 169c330..2d76ec6 100644
# --- a/dvcfiles/f1.dvc
# +++ b/dvcfiles/f1.dvc
# @@ -1,5 +1,4 @@
#  outs:
#  - md5: ba1f2511fc30423bdbb183fe33f3dd0f
#    size: 4
# -  isexec: true
#    path: f1
# diff --git a/dvcfiles/f2.dvc b/dvcfiles/f2.dvc
# index 22c3317..7c42519 100644
# --- a/dvcfiles/f2.dvc
# +++ b/dvcfiles/f2.dvc
# @@ -1,5 +1,4 @@
#  outs:
#  - md5: e42bb897d0afcdb1f1c46fb5e0c1ad22
#    size: 4
# -  isexec: true
#    path: f2

dvc version
# DVC version: 2.10.1 (pip)
# ---------------------------------
# Platform: Python 3.9.7 on Linux-5.10.0-11-amd64-x86_64-with-glibc2.31
# Supports:
#         webhdfs (fsspec = 2022.3.0),
#         http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
#         https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
# Cache types: reflink, hardlink, symlink
# Cache directory: btrfs on /dev/sda1
# Caches: local
# Remotes: None
# Workspace directory: btrfs on /dev/sda1
# Repo: dvc, git

git --version
# git version 2.30.2

Expected

One of two things:

Environment information

Debian 11, both btrfs and ext4

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.1 (pip)
---------------------------------
Platform: Python 3.9.7 on Linux-5.10.0-11-amd64-x86_64-with-glibc2.31
Supports:
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/sda1
Caches: local
Remotes: None
Workspace directory: btrfs on /dev/sda1
Repo: dvc, git

Additional Information (if any):

dberenbaum commented 2 years ago

@skshetry @efiop In the dvc add -R dvcfiles scenario, I'm not understanding why DVC doesn't preserve isexec since each individual file has its own .dvc file. In the case of adding the entire directory with dvc add dvcfiles, it might be trickier since there's no way currently to save the isexec info. WDYT?

It seems like it would also be nice to preserve the executable flag at the group/other levels, but let's leave that for a separate issue.

karajan1001 commented 2 years ago

I can reproduce it on the latest DVC version.

dberenbaum commented 2 years ago

It looks like this is a regression introduced in #7313.

Prior to that change, after running the script above, I get:

$ ll dvc-reprex/dvcfiles
total 40
drwxr-xr-x  7 dave  wheel  224 May  3 20:12 .
drwxr-xr-x  6 dave  wheel  192 May  3 20:12 ..
-rw-r--r--  1 dave  wheel    8 May  3 20:12 .gitignore
-rwxr--r--  1 dave  wheel    4 May  3 20:12 f1
-rw-r--r--  1 dave  wheel   82 May  3 20:12 f1.dvc
-rwxr--r--  1 dave  wheel    4 May  3 20:12 f2
-rw-r--r--  1 dave  wheel   82 May  3 20:12 f2.dvc

Running the same script after that change, I get:

$ ll dvc-reprex/dvcfiles
total 40
drwxr-xr-x  7 dave  wheel  224 May  3 20:12 .
drwxr-xr-x  6 dave  wheel  192 May  3 20:12 ..
-rw-r--r--  1 dave  wheel    8 May  3 20:12 .gitignore
-rw-r--r--  1 dave  wheel    4 May  3 20:12 f1
-rw-r--r--  1 dave  wheel   67 May  3 20:12 f1.dvc
-rw-r--r--  1 dave  wheel    4 May  3 20:12 f2
-rw-r--r--  1 dave  wheel   67 May  3 20:12 f2.dvc

cc @efiop

karajan1001 commented 2 years ago

I'm curious how we handle this kind of metadata for now. An example is the tar in Unix which is good at handling Unix permissions, hard links, softlinks, different times associated with files. To achieve this, it stores this metadata as a header to the original file.