Closed dberenbaum closed 2 months ago
checkout-noop
benchmarks are also slower:
This must have started happening from 2.8.0 and is expected, see https://github.com/iterative/dvc/pull/6648#issuecomment-924456165. '
Do you mean for the checkout
behavior specifically? Is it expected that it will also impact add-noop
? And by that much?
That change affected any command that internally used checkout
, including add
/checkout
/push
/pull
. Before we used to be intelligent, and used modifiedTime
of the directory to skip on noop
. With that above change, we try to collect the directory contents anyway (@efiop, please correct me if I'm wrong here). So the regression could be well 1000x for a directory with a million of files and hence not very meaningful here.
@skshetry Do we have any plan to address this yet?
@skshetry Do we have any plan to address this yet?
yes, we are hoping to improve this by optimizing diff and staging.
Before we used to be intelligent, and used
modifiedTime
of the directory to skip onnoop
.
Is there any plan to get back to this? Optimizing diff and staging will presumably still be slower than the previous behavior, right?
Is there any plan to get back to this?
I think I was wrong about that, last time I tried it was always trying to stage, so I thought it was the state changes. Or maybe something has changed.
Right now, the time is spent on unnecessary file checkouts. DVC is correctly loading the cached .dir
file /Tree
from the state. You can see profiling data in the following image:
* Total runtime is 2.57s for comparison.
The plan is to optimize diff
and staging
so that we can avoid those unnecessary checkouts. And, in the long term, introduce parallel checkouts.
dvc add
-ing 1 million files (with 1KB
each):
./dump.prof Took 3h20m38s in total.
Interesting, thanks @skshetry. Not sure what's going on with the benchmarks in that case. I'm not seeing the reported regression when testing with an unzipped http://images.cocodataset.org/zips/test2014.zip. In fact, it appears to have improved significantly.
It took over a minute to add in 2.6.3:
$ dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:28, 28.25s/file]
To track the changes with git, run:
git add test2014.dvc
100% Adding...|████████████████████████████████████████████████|1/1 [01:03, 63.93s/file]
To track the changes with git, run:
git add test2014.dvc
dvc add test2014 27.49s user 14.15s system 64% cpu 1:04.72 total
DVC version: 2.6.3
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
azure (adlfs = 2021.7.1, knack = 0.9.0, azure-identity = 1.7.1),
gdrive (pydrive2 = 1.10.1),
gs (gcsfs = 2021.7.0),
hdfs (pyarrow = 7.0.0),
webhdfs (hdfs = 2.5.8),
http (requests = 2.27.1),
https (requests = 2.27.1),
s3 (s3fs = 2021.7.0, boto3 = 1.21.21),
ssh (sshfs = 2021.7.1),
oss (ossfs = 2021.7.5),
webdav (webdav4 = 0.9.4),
webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
You are using dvc version 2.6.3; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.
And it only took 22 seconds in 2.10.0:
$ dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:25, 25.18s/file]
To track the changes with git, run:
git add test2014.dvc
To enable auto staging, run:
dvc config core.autostage true
100% Adding...|████████████████████████████████████████████████|1/1 [00:22, 22.50s/file]
To track the changes with git, run:
git add test2014.dvc
To enable auto staging, run:
dvc config core.autostage true
dvc add test2014 6.01s user 10.39s system 71% cpu 22.907 total
DVC version: 2.10.0
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.7.1),
gdrive (pydrive2 = 1.10.1),
gs (gcsfs = 2022.3.0),
hdfs (fsspec = 2022.3.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.3.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
s3 (s3fs = 2022.3.0, boto3 = 1.21.21),
ssh (sshfs = 2022.3.1),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.4),
webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
You are using dvc version 2.10.0; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.
I'm not seeing the reported regression
need to investigate, but what differs between your setup and in the benchmark is the link type: reflink vs copy.
Thanks. Using copy link type, I'm seeing a slight regression, but also it's way faster than it was for reflink. Is that expected?
2.6.3:
$ dvc config cache.type copy; echo foo >> test2014/foo; dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:26, 26.46s/file]
To track the changes with git, run:
git add test2014.dvc
100% Adding...|████████████████████████████████████████████████|1/1 [00:04, 4.62s/file]
To track the changes with git, run:
git add test2014.dvc
dvc add test2014 3.76s user 1.42s system 97% cpu 5.322 total
DVC version: 2.6.3
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
azure (adlfs = 2021.7.1, knack = 0.9.0, azure-identity = 1.7.1),
gdrive (pydrive2 = 1.10.1),
gs (gcsfs = 2021.7.0),
hdfs (pyarrow = 7.0.0),
webhdfs (hdfs = 2.5.8),
http (requests = 2.27.1),
https (requests = 2.27.1),
s3 (s3fs = 2021.7.0, boto3 = 1.21.21),
ssh (sshfs = 2021.7.1),
oss (ossfs = 2021.7.5),
webdav (webdav4 = 0.9.4),
webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
You are using dvc version 2.6.3; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.
2.10.0:
$ dvc config cache.type copy; echo foo >> test2014/foo; dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:25, 25.22s/file]
To track the changes with git, run:
git add test2014.dvc
To enable auto staging, run:
dvc config core.autostage true
100% Adding...|████████████████████████████████████████████████|1/1 [00:07, 7.97s/file]
To track the changes with git, run:
git add test2014.dvc
To enable auto staging, run:
dvc config core.autostage true
dvc add test2014 4.70s user 3.15s system 91% cpu 8.582 total
DVC version: 2.10.0
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
azure (adlfs = 2021.7.1, knack = 0.9.0, azure-identity = 1.7.1),
gdrive (pydrive2 = 1.10.1),
gs (gcsfs = 2021.7.0),
hdfs (fsspec = 2021.7.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2021.7.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
s3 (s3fs = 2021.7.0, boto3 = 1.21.21),
ssh (sshfs = 2021.7.1),
oss (ossfs = 2021.7.5),
webdav (webdav4 = 0.9.4),
webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
You are using dvc version 2.10.0; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.
Thanks. Using copy link type, I'm seeing a slight regression, but also it's way faster than it was for reflink. Is that expected?
Even in copy linktype, dvc still tries to reflink first and fallbacks to copy.
Even in copy linktype, dvc still tries to reflink first and fallbacks to copy.
Any idea why the operation was so much faster?
Between 2.9.1 and 2.6.3, there was a 6.68x regression in the
test_sharing-add-noop
benchmark: