Closed elevaitleo closed 4 weeks ago
Thanks for your report, especially for your detailed reproduce script. Tried it on my computer, it creates an external repo for every single file. I guess why it was so slow.
2021-09-17 22:10:38,983 DEBUG: Creating external repo ../dvc-source/@0c305a625d9eaef3d7b3cd3c4b70cf34b87da1f6
2021-09-17 22:10:38,993 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,028 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,029 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,029 DEBUG: Creating external repo ../dvc-source/@0c305a625d9eaef3d7b3cd3c4b70cf34b87da1f6
2021-09-17 22:10:39,042 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,079 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
2021-09-17 22:10:39,080 TRACE: Assuming '/Users/gao/Code/test/6640/importing-repo/.dvc/cache/69/d7c1bb0f6a2993b3625bf165ec19f0.dir' is unchanged since it is read-only
@karajan1001 thanks for confirming. I'd be ready to help fixing it if you guide me to the relevant part in the code.
In my use case, I have different registries separating different sources and they are imported by the training repository. Hence I import large datasets composed of many many small files and this leads to a dvc status
that can take several hours...
Sorry for the late reply, I'm on vacation for the past week.
Because these part of code is in refactoring progress, I'm not familiar with them even myself. Looks like
Can't get fs_cache and continuously create external repo.
@karajan1001 , thanks for the hint. If I have some, I will try to have a look at it.
Are there any news on this? After discussion on discord I think I'm facing the same issue. I have a repo in which I have imported a larger dataset (20G, 10k files) via dvc import
(when the data is copied and added, everything is fine). And dvc status
and dvc commit
are very slow.
I get this for every(?) file:
2022-04-19 10:22:57,772 DEBUG: Creating external repo ssh://git@my-git-remote:9999/rp/dvc-data-registry.git@43ea6a39915fcaf0a07e2535279142529bd10408
So it seems that the repo is cloned for every file. Also encountered that fs_cache
is empty (in the code snippet from @karajan1001 ).
DVC version: 2.9.5 (also 2.8.3 and 2.10.1)
---------------------------------
Platform: Python 3.9.0 on Linux-5.8.18-100.fc31.x86_64-x86_64-with-glibc2.30
Supports:
webhdfs (fsspec = 2022.3.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/fedora_localhost--live-home
Caches: local
Remotes: local
It seems the data registry repo clone is cached but the cached repo queried for each file. Why is this necessary? Essentially the repo stores only a single .dvc
file. Couldn't this be kept in memory?
@skshetry Do you think we can include this scenario in dvc status
improvements, or should we keep it separate since it relates to imports and external repo cloning?
Unable to reproduce in 3.54.0
Bug Report
status: slow performance and high memory consumption after importing parent of a dvc-tracked folder
Description
I have a dvc repository with the following structure
I track the subfolder.
In a different repo, I want to import the parent folder. After importing it, if I run
dvc status
, the hash computation is really slow (around 13/s while the dvc add in the first place was faster than 800/s). Moreover, the memory consumption increases steadily over time and is more than 8GB for around 10k files (each containing only a few characters).Reproduce
Expected
If I dvc import
parent/subfolder
, thendvc status
is fast and memory consumption is low. I would expect this behavior also after importingparent
.Environment information
Output of
dvc doctor
:Additional Information (if any):