iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.93k stars 1.19k forks source link

benchmark regressions #9914

Closed dberenbaum closed 10 months ago

dberenbaum commented 1 year ago

Investigate regressions in https://github.com/iterative/dvc-bench for:

skshetry commented 1 year ago

For exp show and plots, it looks like the repository used for benchmarking is example-get-started which is incompatible with 2.x due to artifacts in dvc.yaml and md5/hash in dvc.lock changes, so they are failing.

Plots were never in sub-second range.

dberenbaum commented 1 year ago

Makes sense, so we only need to investigate push

dberenbaum commented 1 year ago

Added pull based on issues since 3.20

pmrowla commented 1 year ago

It looks like the issue for both fetch and push is with index collection - specifically the problem is that sqltrie traverse is slow.

This is a local fetch of the cats_dogs dataset (25k files) in main, using 3.10.0 (without the index collection) the fetch only takes 10.3s with the profiler enabled: Screenshot 2023-11-15 at 5 46 47 PM

Caching the sqltrie._traverse() results in memory cuts the fetch time in half, but the real issue is that the underlying script to generate the temporary steps tables for traversal has to be called once per file, and that temp table generation is the main performance blocker. (the caching cuts the runtime exactly in half because we are calling traverse twice per file)

So fixing this will require revisiting how the sqltrie traversal queries work. Or maybe we need to consider only using sqltrie for dump/load and then do all of our index operations in memory. cc @efiop

pmrowla commented 1 year ago

Actually it looks like we can probably just get away with building an in-memory pygtrie for sqltrie views, since views are the only place we heavily use traverse right now

dberenbaum commented 11 months ago

I think closing was an accident @efiop?

efiop commented 11 months ago

Yeah, sorry about that.

We only have push left to check it looks like.

dberenbaum commented 11 months ago

Does https://github.com/iterative/sqltrie/pull/24 fix all pull regressions? Also I opened https://github.com/iterative/dvc-bench/issues/478 since I notice the benchmarks are failing.

pmrowla commented 11 months ago

Yes, all the pull regressions should be resolved after the sqltrie changes.

pmrowla commented 11 months ago

bench.dvc.org is up to date now, pull regressions have been resolved but it looks like push has still regressed since 3.10

pull push

skshetry commented 11 months ago

pull still seems to have regressed, no? The index migration for pull/fetch/checkout happened in https://github.com/iterative/dvc/pull/9424 in 3.0.

efiop commented 11 months ago

Switched pull -> fetch in https://github.com/iterative/dvc/pull/10117 and there is no huge regression for local:

2023-11-28T18:04:54.4979752Z ---------------------------------------------------------------------------- benchmark 'test_fetch-fetch': 5 tests ----------------------------------------------------------------------------
2023-11-28T18:04:54.4982155Z Name (time in s)                  Min                 Max                Mean            StdDev              Median               IQR            Outliers     OPS            Rounds  Iterations
2023-11-28T18:04:54.4984185Z -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2023-11-28T18:04:54.4986347Z test_fetch-fetch[2.58.2]     112.4179 (1.0)      112.4179 (1.0)      112.4179 (1.0)      0.0000 (1.0)      112.4179 (1.0)      0.0000 (1.0)           0;0  0.0089 (1.0)           1           1
2023-11-28T18:04:54.4988991Z test_fetch-fetch[3.0.0]      118.5058 (1.05)     118.5058 (1.05)     118.5058 (1.05)     0.0000 (1.0)      118.5058 (1.05)     0.0000 (1.0)           0;0  0.0084 (0.95)          1           1
2023-11-28T18:04:54.4991617Z test_fetch-fetch[main]       118.6735 (1.06)     118.6735 (1.06)     118.6735 (1.06)     0.0000 (1.0)      118.6735 (1.06)     0.0000 (1.0)           0;0  0.0084 (0.95)          1           1
2023-11-28T18:04:54.4994162Z test_fetch-fetch[3.10.0]     119.2171 (1.06)     119.2171 (1.06)     119.2171 (1.06)     0.0000 (1.0)      119.2171 (1.06)     0.0000 (1.0)           0;0  0.0084 (0.94)          1           1
2023-11-28T18:04:54.4996773Z test_fetch-fetch[3.20.0]     119.2340 (1.06)     119.2340 (1.06)     119.2340 (1.06)     0.0000 (1.0)      119.2340 (1.06)     0.0000 (1.0)           0;0  0.0084 (0.94)          1           1
2023-11-28T18:04:54.4998835Z -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2023-11-28T18:04:54.4999696Z 

but there is for checkout, which we already knew.

efiop commented 11 months ago

Same for real clouds:

Screenshot 2023-11-29 at 09 46 20 Screenshot 2023-11-29 at 09 46 31 Screenshot 2023-11-29 at 09 46 47

So it is not pull, it is checkout. Let's leave checkout out of this issue. So only push left to check here.

dberenbaum commented 11 months ago

Why not add checkout to the list? It looks like that is a significant regression.

efiop commented 11 months ago

@dberenbaum Ah, let me just clarify that I meant to leave it out of the current optimization effort considering the schedule. Unlike fetch/push which migrated to index fully, checkout also relies quite a bit on what we do with state and other things, so, even though I don't oppose taking another look there right now, we will absolutely need to take a look at it once state is removed. At least things like noop and update haven't regressed much.

Screenshot 2023-11-29 at 16 32 01
efiop commented 11 months ago

For the record: I'm taking a look at checkout.

dberenbaum commented 11 months ago

Discussed with @efiop and summarizing here. Let's do our best to finalize the major changes needed around index so dvc 3.x can be in a better place and so it doesn't drag into next year as a major project. That means prioritizing both push and checkout here, and more generally removing state and migrating checkout and add to use index (@efiop please correct me if I'm wrong).

efiop commented 11 months ago

Another thing to note: only copy link degraded, symlink and hardlink are actually better than before 3.0 and they are pretty stable:

Screenshot 2023-11-30 at 00 52 43 Screenshot 2023-11-30 at 00 52 55

and locally on macOS I can also see checkout with reflink being almost twice as fast as 2.58.2 was. Looking into copy, but will probably post a summary on it in the morning.

efiop commented 11 months ago

Ok, so for the record: with small files (esp the ones in mnist) the overhead of callback creation in https://github.com/iterative/dvc-objects/blob/231e5273cf929cea2424c9d6e885f72bb5d90750/src/dvc_objects/fs/utils.py#L167 is just too wasteful and unnecessary. There are a few things that we can do there as well (like using file-sized buffer for small files, using larger buffer for large files, or even just using shutils copysomething that can leverage things like sendfile if necessary), but even that alone speeds up link creation for me by like 50x times. reflink/hardlink/symlink bypass that thus not suffering from the overhead, so our goal here is for copy to take about the same time as other links when dealing with small files. The whole copyfile callback situation is rather messy, somewhat keen on revisiting it a bit more. I'll send some patches in the morning.

EDIT: Merged https://github.com/iterative/dvc-objects/pull/229 and triggered https://github.com/iterative/dvc-bench/actions/runs/7040080174 , let's see...

skshetry commented 11 months ago

What changed between 2.58.2 and 3.0.0? The callback was still there before that, no?

efiop commented 11 months ago

The main thing that changed is https://github.com/iterative/dvc/pull/9444 . Before it we were processing files one-by-one and passing them to generic.transfer(), while after it, among other things, we are passing all files that we need to create to generic.transfer() which makes it use a bit more overhead in different places down the line.

This is a bit early, but I've sent a few more fixes and now, at least on my linux machine, it looks like we are not only back to 2.58.2, but actually faster too (~30sec vs ~37sec for mnist checkout). Waiting for proper confirmation in dvc-bench https://github.com/iterative/dvc-bench/actions/runs/7066555049/job/19239201268

efiop commented 11 months ago

Ok, so looks promising so far.

Screenshot 2023-12-02 at 03 11 27

3.20 also improved because it also installed new dvc-objects version. So indeed, checkout-copy is now faster than it was in 2.x -

But because we've unified all transfer-like logic in the past year, we also get these improvements in other commands that use this same code. E.g. fetch from local remote is now 2.5 times faster than it was.

Screenshot 2023-12-02 at 03 13 21

and push to local remote is now ~2 times faster than before and over 4 times faster than in 2.x

Screenshot 2023-12-02 at 03 16 20

and dvc get-url

Screenshot 2023-12-02 at 03 18 42

and dvc import-url

Screenshot 2023-12-02 at 03 19 17

and a few more. We'll see on on bench.dvc.org once ready.

PS: Hopefully not a fluke...

skshetry commented 11 months ago

Great work, @efiop. The only thing that I want to mention is that, https://github.com/iterative/dvc-objects/pull/231 is going to make it slower for users with large files. So we should benchmark on that kind of datasets to be sure. ThreadPoolExecutor has high overhead for small files, but that is noise for larger files which take a lot of time. Parallellizing also help SSDs as they have a lot of bandwidth, and HDDs in planning in their shortest route.

skshetry commented 11 months ago

Maybe it's time to write checkout in rust? 😉

dberenbaum commented 11 months ago

Can we test on some dataset like https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset? We don't need to put them into our regular benchmarks but would be good to test it out this time to see how these changes perform on a large file.

efiop commented 11 months ago

Sorry for the delay, was AFK.

Indeed, we should benchmark other datasets as well, no doubt about that. And we should perform well in both cases, mnist is a very good edge case for that. We can pick smarter strategies based on what we are dealing with.

OpenOrca seems pretty good for testing a dataset that has 1 file that is big-ish, we should definitely add it to the dvc-bench datasets so that we have it around at our disposal, I'll upload it later today 👍 . I was also looking into some famous/well-known video/sound/other catalogue-looking datasets to test many (but not too many, maybe up to 10-100) of big-ish files (so typical classification datasets won't do), to test a bit beyond that. Those videos have to be non-copyright ones, so youtube-based ones won't do. If you have ideas - let me know.

dberenbaum commented 11 months ago

Thanks for adding @efiop! To be clear, I didn't mean we have to add it to dvc-bench, although I'm fine with doing so. Sometimes just doing a one-off test is fine if we have something particular to test.

skshetry commented 11 months ago

Reminder: we have a gentree command that can randomly generate a dataset.

dvc-data gentree data 10 1Gb --depth 1
1.0G data
   - ├── dir00
 86M │  └── 0.txt
   - ├── dir01
111M │  └── 0.txt
   - ├── dir02
 79M │  └── 0.txt
   - ├── dir03
129M │  └── 0.txt
   - ├── dir04
100M │  └── 0.txt
   - ├── dir05
 95M │  └── 0.txt
   - ├── dir06
159M │  └── 0.txt
   - ├── dir07
124M │  └── 0.txt
   - ├── dir08
 95M │  └── 0.txt
 77M └── 0.txt
efiop commented 11 months ago

@dberenbaum It is very nice to have it in the toolset generally, but it also allows us to trigger dispatch runs in CI. I've also added a smaller one used-cars dataset and it is even faster to run than mnist (i mean total time) https://github.com/iterative/dvc-bench/actions/runs/7087497084 , so we can probably run something chunkier on a regular basis, just need to figure out how/where to post. I'll also add a dataset like the one I was talking about above later.

efiop commented 11 months ago

For the record, @skshetry and I did some optimizations on reflink creation as a part of https://github.com/iterative/dql/pull/1007, approx 2-3 times faster now that it was before.

One thing that is obvious from benchmarks is that fetch/push-noop are kinda slow, but we are not caching compacted/deduplicated list of files we get from going through all revisions just yet, so I would get back to that after it is done.

efiop commented 11 months ago

For the record, some benchmarks with different datasets:

mnist (70K+ ~300B images)

https://bench.dvc.org/schedule/7147613411_2.html

xray (6K ~30KB images)

https://bench.dvc.org/workflow_dispatch/7151367965_3.html

used-cargs (1.4G single file)

https://bench.dvc.org/workflow_dispatch/7146419624_1.html

dberenbaum commented 11 months ago

Worth noting that these are not "real" benchmarks for older versions though since those may be benefitting from updates to dvc-objects and other libs, right?

dberenbaum commented 11 months ago

So overall, it looks like it's a great improvement but we are trading off some performance on big files to get improvements on many small files. Is that right?

dberenbaum commented 10 months ago

Any updates here? What about s3? Any idea why it doesn't show the same improvements as other clouds in the benchmarks?

efiop commented 10 months ago

Sorry, I was convinced I responded before, but probably forgot to post.

Worth noting that these are not "real" benchmarks for older versions though since those may be benefitting from updates to dvc-objects and other libs, right?

Yes, mostly recent versions as very old ones will have more strict requirements.

So overall, it looks like it's a great improvement but we are trading off some performance on big files to get improvements on many small files. Is that right?

Marginally in used-cars, but there is no benchmark with many big files which should be more affected.

Any updates here? What about s3? Any idea why it doesn't show the same improvements as other clouds in the benchmarks?

No updates, haven't looked into this since the last time 🙁 s3 is likely because of chunking configuration (similar to improvements that @pmrowla did in adlfs in the past), but haven't looked into it.

dberenbaum commented 10 months ago

@pmrowla Any thoughts on the s3 performance? I'd like to look into that either as part of this issue or a new one.

Is there anything else left before closing this issue?

pmrowla commented 10 months ago

@dberenbaum this can be closed, we have a separate issue for the s3 performance https://github.com/iterative/dvc/issues/9893