iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.63k stars 1.18k forks source link

Large memory consumption and slow hashing on `dvc status` after `dvc import` of parent of dvc-tracked folder #6640

Closed elevaitleo closed 4 weeks ago

elevaitleo commented 3 years ago

Bug Report

status: slow performance and high memory consumption after importing parent of a dvc-tracked folder

Description

I have a dvc repository with the following structure

parent/subfolder/[many files]

I track the subfolder.

In a different repo, I want to import the parent folder. After importing it, if I run dvc status, the hash computation is really slow (around 13/s while the dvc add in the first place was faster than 800/s). Moreover, the memory consumption increases steadily over time and is more than 8GB for around 10k files (each containing only a few characters).

Reproduce

# Create first dvc repo from which to import later
mkdir dvc-source
cd dvc-source
git init
dvc init
git commit -m "empty dataset"
# create files
mkdir -p parent/subfolder
for n in $(seq 10000); do; echo $n > parent/subfolder/$n.txt; done;
dvc add parent/subfolder  #  ~800 md5/s
git add parent/.gitignore parent/subfolder.dvc
git commit -m "add dataset"

# Create importing repo
cd ..
mkdir importing-repo
cd importing-repo
git init
dvc init
git commit -m "empty dataset"
dvc import ../dvc-source/ parent
dvc status  # slow <20 md5/s and memory allocation goes up

Expected

If I dvc import parent/subfolder, then dvc status is fast and memory consumption is low. I would expect this behavior also after importing parent.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.7.3 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.11.0-34-generic-x86_64-with-glibc2.29
Supports:
    azure (adlfs = 2021.9.1, knack = 0.8.2, azure-identity = 1.6.1),
    gdrive (pydrive2 = 1.9.3),
    gs (gcsfs = 2021.8.1),
    hdfs (pyarrow = 5.0.0),
    webhdfs (hdfs = 2.6.0),
    http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
    https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
    s3 (s3fs = 2021.8.1, boto3 = 1.17.106),
    ssh (sshfs = 2021.8.1),
    oss (ossfs = 2021.8.0),
    webdav (webdav4 = 0.9.1),
    webdavs (webdav4 = 0.9.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git

Additional Information (if any):

karajan1001 commented 3 years ago

Thanks for your report, especially for your detailed reproduce script. Tried it on my computer, it creates an external repo for every single file. I guess why it was so slow.

2021-09-17 22:10:38,983 DEBUG: Creating external repo ../dvc-source/@0c305a625d9eaef3d7b3cd3c4b70cf34b87da1f6
2021-09-17 22:10:38,993 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,029 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,029 DEBUG: Creating external repo ../dvc-source/@0c305a625d9eaef3d7b3cd3c4b70cf34b87da1f6
2021-09-17 22:10:39,042 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,079 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
elevaitleo commented 3 years ago

@karajan1001 thanks for confirming. I'd be ready to help fixing it if you guide me to the relevant part in the code. In my use case, I have different registries separating different sources and they are imported by the training repository. Hence I import large datasets composed of many many small files and this leads to a dvc status that can take several hours...

karajan1001 commented 2 years ago

Sorry for the late reply, I'm on vacation for the past week.

Because these part of code is in refactoring progress, I'm not familiar with them even myself. Looks like

https://github.com/iterative/dvc/blob/48c05277556eacbdf523cb6fc358c326f4f96d64/dvc/objects/reference.py#L96-L101

Can't get fs_cache and continuously create external repo.

elevaitleo commented 2 years ago

@karajan1001 , thanks for the hint. If I have some, I will try to have a look at it.

weidenka commented 2 years ago

Are there any news on this? After discussion on discord I think I'm facing the same issue. I have a repo in which I have imported a larger dataset (20G, 10k files) via dvc import (when the data is copied and added, everything is fine). And dvc status and dvc commit are very slow.

I get this for every(?) file:

2022-04-19 10:22:57,772 DEBUG: Creating external repo ssh://git@my-git-remote:9999/rp/dvc-data-registry.git@43ea6a39915fcaf0a07e2535279142529bd10408

So it seems that the repo is cloned for every file. Also encountered that fs_cache is empty (in the code snippet from @karajan1001 ).

DVC version: 2.9.5 (also 2.8.3 and 2.10.1)
---------------------------------
Platform: Python 3.9.0 on Linux-5.8.18-100.fc31.x86_64-x86_64-with-glibc2.30
Supports:
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/fedora_localhost--live-home
Caches: local
Remotes: local
weidenka commented 2 years ago

It seems the data registry repo clone is cached but the cached repo queried for each file. Why is this necessary? Essentially the repo stores only a single .dvc file. Couldn't this be kept in memory?

dberenbaum commented 2 years ago

@skshetry Do you think we can include this scenario in dvc status improvements, or should we keep it separate since it relates to imports and external repo cloning?

dberenbaum commented 4 weeks ago

Unable to reproduce in 3.54.0