iterative / dvc-bench

Benchmarks for DVC
http://bench.dvc.org/
Apache License 2.0
20 stars 10 forks source link

6.68x regression in `test_sharing-add-noop` #315

Closed dberenbaum closed 2 months ago

dberenbaum commented 2 years ago

Between 2.9.1 and 2.6.3, there was a 6.68x regression in the test_sharing-add-noop benchmark:

Screen Shot 2022-01-20 at 9 18 21 AM
dberenbaum commented 2 years ago

checkout-noop benchmarks are also slower:

Screen Shot 2022-01-20 at 9 22 08 AM Screen Shot 2022-01-20 at 9 21 50 AM
skshetry commented 2 years ago

This must have started happening from 2.8.0 and is expected, see https://github.com/iterative/dvc/pull/6648#issuecomment-924456165. '

dberenbaum commented 2 years ago

Do you mean for the checkout behavior specifically? Is it expected that it will also impact add-noop? And by that much?

skshetry commented 2 years ago

That change affected any command that internally used checkout, including add/checkout/push/pull. Before we used to be intelligent, and used modifiedTime of the directory to skip on noop. With that above change, we try to collect the directory contents anyway (@efiop, please correct me if I'm wrong here). So the regression could be well 1000x for a directory with a million of files and hence not very meaningful here.

dberenbaum commented 2 years ago

@skshetry Do we have any plan to address this yet?

skshetry commented 2 years ago

@skshetry Do we have any plan to address this yet?

yes, we are hoping to improve this by optimizing diff and staging.

dberenbaum commented 2 years ago

Before we used to be intelligent, and used modifiedTime of the directory to skip on noop.

Is there any plan to get back to this? Optimizing diff and staging will presumably still be slower than the previous behavior, right?

skshetry commented 2 years ago

Is there any plan to get back to this?

I think I was wrong about that, last time I tried it was always trying to stage, so I thought it was the state changes. Or maybe something has changed.

Right now, the time is spent on unnecessary file checkouts. DVC is correctly loading the cached .dir file /Tree from the state. You can see profiling data in the following image:

Screen Shot 2022-05-14 at 12 18 01

* Total runtime is 2.57s for comparison.

The plan is to optimize diff and staging so that we can avoid those unnecessary checkouts. And, in the long term, introduce parallel checkouts.

skshetry commented 2 years ago

dvc add-ing 1 million files (with 1KB each):

Screen Shot 2022-05-15 at 18 01 22

./dump.prof Took 3h20m38s in total.

dberenbaum commented 2 years ago

Interesting, thanks @skshetry. Not sure what's going on with the benchmarks in that case. I'm not seeing the reported regression when testing with an unzipped http://images.cocodataset.org/zips/test2014.zip. In fact, it appears to have improved significantly.

It took over a minute to add in 2.6.3:

$ dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:28, 28.25s/file]

To track the changes with git, run:

        git add test2014.dvc
100% Adding...|████████████████████████████████████████████████|1/1 [01:03, 63.93s/file]

To track the changes with git, run:

        git add test2014.dvc
dvc add test2014  27.49s user 14.15s system 64% cpu 1:04.72 total
DVC version: 2.6.3
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
        azure (adlfs = 2021.7.1, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2021.7.0),
        hdfs (pyarrow = 7.0.0),
        webhdfs (hdfs = 2.5.8),
        http (requests = 2.27.1),
        https (requests = 2.27.1),
        s3 (s3fs = 2021.7.0, boto3 = 1.21.21),
        ssh (sshfs = 2021.7.1),
        oss (ossfs = 2021.7.5),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

You are using dvc version 2.6.3; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.

And it only took 22 seconds in 2.10.0:

$ dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:25, 25.18s/file]

To track the changes with git, run:

    git add test2014.dvc

To enable auto staging, run:

        dvc config core.autostage true
100% Adding...|████████████████████████████████████████████████|1/1 [00:22, 22.50s/file]

To track the changes with git, run:

    git add test2014.dvc

To enable auto staging, run:

        dvc config core.autostage true
dvc add test2014  6.01s user 10.39s system 71% cpu 22.907 total
DVC version: 2.10.0
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
        azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2022.3.0),
        hdfs (fsspec = 2022.3.0, pyarrow = 7.0.0),
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
        s3 (s3fs = 2022.3.0, boto3 = 1.21.21),
        ssh (sshfs = 2022.3.1),
        oss (ossfs = 2021.8.0),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

You are using dvc version 2.10.0; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.
skshetry commented 2 years ago

I'm not seeing the reported regression

need to investigate, but what differs between your setup and in the benchmark is the link type: reflink vs copy.

dberenbaum commented 2 years ago

Thanks. Using copy link type, I'm seeing a slight regression, but also it's way faster than it was for reflink. Is that expected?

2.6.3:

$ dvc config cache.type copy; echo foo >> test2014/foo; dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:26, 26.46s/file]

To track the changes with git, run:

        git add test2014.dvc
100% Adding...|████████████████████████████████████████████████|1/1 [00:04,  4.62s/file]

To track the changes with git, run:

        git add test2014.dvc
dvc add test2014  3.76s user 1.42s system 97% cpu 5.322 total
DVC version: 2.6.3
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
        azure (adlfs = 2021.7.1, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2021.7.0),
        hdfs (pyarrow = 7.0.0),
        webhdfs (hdfs = 2.5.8),
        http (requests = 2.27.1),
        https (requests = 2.27.1),
        s3 (s3fs = 2021.7.0, boto3 = 1.21.21),
        ssh (sshfs = 2021.7.1),
        oss (ossfs = 2021.7.5),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

You are using dvc version 2.6.3; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.

2.10.0:

$ dvc config cache.type copy; echo foo >> test2014/foo; dvc add test2014; time dvc add test2014; dvc doctor
100% Adding...|████████████████████████████████████████████████|1/1 [00:25, 25.22s/file]

To track the changes with git, run:

    git add test2014.dvc

To enable auto staging, run:

        dvc config core.autostage true
100% Adding...|████████████████████████████████████████████████|1/1 [00:07,  7.97s/file]

To track the changes with git, run:

    git add test2014.dvc

To enable auto staging, run:

        dvc config core.autostage true
dvc add test2014  4.70s user 3.15s system 91% cpu 8.582 total
DVC version: 2.10.0
---------------------------------
Platform: Python 3.10.2 on macOS-12.3.1-arm64-arm-64bit
Supports:
        azure (adlfs = 2021.7.1, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2021.7.0),
        hdfs (fsspec = 2021.7.0, pyarrow = 7.0.0),
        webhdfs (fsspec = 2021.7.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.5),
        s3 (s3fs = 2021.7.0, boto3 = 1.21.21),
        ssh (sshfs = 2021.7.1),
        oss (ossfs = 2021.7.5),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

You are using dvc version 2.10.0; however, version 2.10.2 is available.
Find the latest release at https://github.com/iterative/dvc/releases/latest.
skshetry commented 2 years ago

Thanks. Using copy link type, I'm seeing a slight regression, but also it's way faster than it was for reflink. Is that expected?

Even in copy linktype, dvc still tries to reflink first and fallbacks to copy.

dberenbaum commented 2 years ago

Even in copy linktype, dvc still tries to reflink first and fallbacks to copy.

Any idea why the operation was so much faster?