Closed dberenbaum closed 10 months ago
For exp show
and plots
, it looks like the repository used for benchmarking is example-get-started
which is incompatible with 2.x
due to artifacts
in dvc.yaml and md5/hash
in dvc.lock
changes, so they are failing.
Plots were never in sub-second range.
Makes sense, so we only need to investigate push
Added pull
based on issues since 3.20
It looks like the issue for both fetch and push is with index collection - specifically the problem is that sqltrie traverse is slow.
This is a local fetch of the cats_dogs dataset (25k files) in main, using 3.10.0 (without the index collection) the fetch only takes 10.3s with the profiler enabled:
Caching the sqltrie._traverse()
results in memory cuts the fetch time in half, but the real issue is that the underlying script to generate the temporary steps tables for traversal has to be called once per file, and that temp table generation is the main performance blocker. (the caching cuts the runtime exactly in half because we are calling traverse twice per file)
So fixing this will require revisiting how the sqltrie traversal queries work. Or maybe we need to consider only using sqltrie for dump/load and then do all of our index operations in memory. cc @efiop
Actually it looks like we can probably just get away with building an in-memory pygtrie for sqltrie views, since views are the only place we heavily use traverse right now
I think closing was an accident @efiop?
Yeah, sorry about that.
We only have push
left to check it looks like.
Does https://github.com/iterative/sqltrie/pull/24 fix all pull regressions? Also I opened https://github.com/iterative/dvc-bench/issues/478 since I notice the benchmarks are failing.
Yes, all the pull regressions should be resolved after the sqltrie changes.
bench.dvc.org is up to date now, pull regressions have been resolved but it looks like push has still regressed since 3.10
pull
still seems to have regressed, no? The index migration for pull/fetch/checkout happened in https://github.com/iterative/dvc/pull/9424 in 3.0.
Switched pull -> fetch in https://github.com/iterative/dvc/pull/10117 and there is no huge regression for local:
2023-11-28T18:04:54.4979752Z ---------------------------------------------------------------------------- benchmark 'test_fetch-fetch': 5 tests ----------------------------------------------------------------------------
2023-11-28T18:04:54.4982155Z Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
2023-11-28T18:04:54.4984185Z -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2023-11-28T18:04:54.4986347Z test_fetch-fetch[2.58.2] 112.4179 (1.0) 112.4179 (1.0) 112.4179 (1.0) 0.0000 (1.0) 112.4179 (1.0) 0.0000 (1.0) 0;0 0.0089 (1.0) 1 1
2023-11-28T18:04:54.4988991Z test_fetch-fetch[3.0.0] 118.5058 (1.05) 118.5058 (1.05) 118.5058 (1.05) 0.0000 (1.0) 118.5058 (1.05) 0.0000 (1.0) 0;0 0.0084 (0.95) 1 1
2023-11-28T18:04:54.4991617Z test_fetch-fetch[main] 118.6735 (1.06) 118.6735 (1.06) 118.6735 (1.06) 0.0000 (1.0) 118.6735 (1.06) 0.0000 (1.0) 0;0 0.0084 (0.95) 1 1
2023-11-28T18:04:54.4994162Z test_fetch-fetch[3.10.0] 119.2171 (1.06) 119.2171 (1.06) 119.2171 (1.06) 0.0000 (1.0) 119.2171 (1.06) 0.0000 (1.0) 0;0 0.0084 (0.94) 1 1
2023-11-28T18:04:54.4996773Z test_fetch-fetch[3.20.0] 119.2340 (1.06) 119.2340 (1.06) 119.2340 (1.06) 0.0000 (1.0) 119.2340 (1.06) 0.0000 (1.0) 0;0 0.0084 (0.94) 1 1
2023-11-28T18:04:54.4998835Z -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2023-11-28T18:04:54.4999696Z
but there is for checkout, which we already knew.
Same for real clouds:
So it is not pull, it is checkout. Let's leave checkout out of this issue. So only push
left to check here.
Why not add checkout
to the list? It looks like that is a significant regression.
@dberenbaum Ah, let me just clarify that I meant to leave it out of the current optimization effort considering the schedule. Unlike fetch/push which migrated to index fully, checkout also relies quite a bit on what we do with state
and other things, so, even though I don't oppose taking another look there right now, we will absolutely need to take a look at it once state
is removed. At least things like noop and update haven't regressed much.
For the record: I'm taking a look at checkout.
Discussed with @efiop and summarizing here. Let's do our best to finalize the major changes needed around index so dvc 3.x can be in a better place and so it doesn't drag into next year as a major project. That means prioritizing both push and checkout here, and more generally removing state and migrating checkout and add to use index (@efiop please correct me if I'm wrong).
Another thing to note: only copy
link degraded, symlink
and hardlink
are actually better than before 3.0 and they are pretty stable:
and locally on macOS I can also see checkout
with reflink
being almost twice as fast as 2.58.2 was. Looking into copy
, but will probably post a summary on it in the morning.
Ok, so for the record: with small files (esp the ones in mnist) the overhead of callback creation in https://github.com/iterative/dvc-objects/blob/231e5273cf929cea2424c9d6e885f72bb5d90750/src/dvc_objects/fs/utils.py#L167 is just too wasteful and unnecessary. There are a few things that we can do there as well (like using file-sized buffer for small files, using larger buffer for large files, or even just using shutils copysomething that can leverage things like sendfile if necessary), but even that alone speeds up link creation for me by like 50x times. reflink/hardlink/symlink bypass that thus not suffering from the overhead, so our goal here is for copy
to take about the same time as other links when dealing with small files. The whole copyfile
callback situation is rather messy, somewhat keen on revisiting it a bit more. I'll send some patches in the morning.
EDIT: Merged https://github.com/iterative/dvc-objects/pull/229 and triggered https://github.com/iterative/dvc-bench/actions/runs/7040080174 , let's see...
What changed between 2.58.2 and 3.0.0? The callback was still there before that, no?
The main thing that changed is https://github.com/iterative/dvc/pull/9444 . Before it we were processing files one-by-one and passing them to generic.transfer()
, while after it, among other things, we are passing all files that we need to create to generic.transfer()
which makes it use a bit more overhead in different places down the line.
This is a bit early, but I've sent a few more fixes and now, at least on my linux machine, it looks like we are not only back to 2.58.2, but actually faster too (~30sec vs ~37sec for mnist checkout). Waiting for proper confirmation in dvc-bench https://github.com/iterative/dvc-bench/actions/runs/7066555049/job/19239201268
Ok, so looks promising so far.
3.20 also improved because it also installed new dvc-objects version. So indeed, checkout-copy is now faster than it was in 2.x -
But because we've unified all transfer-like logic in the past year, we also get these improvements in other commands that use this same code. E.g. fetch from local remote is now 2.5 times faster than it was.
and push to local remote is now ~2 times faster than before and over 4 times faster than in 2.x
and dvc get-url
and dvc import-url
and a few more. We'll see on on bench.dvc.org once ready.
PS: Hopefully not a fluke...
Great work, @efiop. The only thing that I want to mention is that, https://github.com/iterative/dvc-objects/pull/231 is going to make it slower for users with large files. So we should benchmark on that kind of datasets to be sure. ThreadPoolExecutor has high overhead for small files, but that is noise for larger files which take a lot of time. Parallellizing also help SSDs as they have a lot of bandwidth, and HDDs in planning in their shortest route.
Maybe it's time to write checkout in rust? 😉
Can we test on some dataset like https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset? We don't need to put them into our regular benchmarks but would be good to test it out this time to see how these changes perform on a large file.
Sorry for the delay, was AFK.
Indeed, we should benchmark other datasets as well, no doubt about that. And we should perform well in both cases, mnist is a very good edge case for that. We can pick smarter strategies based on what we are dealing with.
OpenOrca seems pretty good for testing a dataset that has 1 file that is big-ish, we should definitely add it to the dvc-bench datasets so that we have it around at our disposal, I'll upload it later today 👍 . I was also looking into some famous/well-known video/sound/other catalogue-looking datasets to test many (but not too many, maybe up to 10-100) of big-ish files (so typical classification datasets won't do), to test a bit beyond that. Those videos have to be non-copyright ones, so youtube-based ones won't do. If you have ideas - let me know.
Thanks for adding @efiop! To be clear, I didn't mean we have to add it to dvc-bench, although I'm fine with doing so. Sometimes just doing a one-off test is fine if we have something particular to test.
Reminder: we have a gentree command that can randomly generate a dataset.
dvc-data gentree data 10 1Gb --depth 1
1.0G data
- ├── dir00
86M │ └── 0.txt
- ├── dir01
111M │ └── 0.txt
- ├── dir02
79M │ └── 0.txt
- ├── dir03
129M │ └── 0.txt
- ├── dir04
100M │ └── 0.txt
- ├── dir05
95M │ └── 0.txt
- ├── dir06
159M │ └── 0.txt
- ├── dir07
124M │ └── 0.txt
- ├── dir08
95M │ └── 0.txt
77M └── 0.txt
@dberenbaum It is very nice to have it in the toolset generally, but it also allows us to trigger dispatch
runs in CI. I've also added a smaller one used-cars
dataset and it is even faster to run than mnist (i mean total time) https://github.com/iterative/dvc-bench/actions/runs/7087497084 , so we can probably run something chunkier on a regular basis, just need to figure out how/where to post. I'll also add a dataset like the one I was talking about above later.
For the record, @skshetry and I did some optimizations on reflink creation as a part of https://github.com/iterative/dql/pull/1007, approx 2-3 times faster now that it was before.
One thing that is obvious from benchmarks is that fetch/push-noop are kinda slow, but we are not caching compacted/deduplicated list of files we get from going through all revisions just yet, so I would get back to that after it is done.
For the record, some benchmarks with different datasets:
mnist (70K+ ~300B images)
https://bench.dvc.org/schedule/7147613411_2.html
xray (6K ~30KB images)
https://bench.dvc.org/workflow_dispatch/7151367965_3.html
used-cargs (1.4G single file)
Worth noting that these are not "real" benchmarks for older versions though since those may be benefitting from updates to dvc-objects and other libs, right?
So overall, it looks like it's a great improvement but we are trading off some performance on big files to get improvements on many small files. Is that right?
Any updates here? What about s3? Any idea why it doesn't show the same improvements as other clouds in the benchmarks?
Sorry, I was convinced I responded before, but probably forgot to post.
Worth noting that these are not "real" benchmarks for older versions though since those may be benefitting from updates to dvc-objects and other libs, right?
Yes, mostly recent versions as very old ones will have more strict requirements.
So overall, it looks like it's a great improvement but we are trading off some performance on big files to get improvements on many small files. Is that right?
Marginally in used-cars
, but there is no benchmark with many big files which should be more affected.
Any updates here? What about s3? Any idea why it doesn't show the same improvements as other clouds in the benchmarks?
No updates, haven't looked into this since the last time 🙁 s3 is likely because of chunking configuration (similar to improvements that @pmrowla did in adlfs in the past), but haven't looked into it.
@pmrowla Any thoughts on the s3 performance? I'd like to look into that either as part of this issue or a new one.
Is there anything else left before closing this issue?
@dberenbaum this can be closed, we have a separate issue for the s3 performance https://github.com/iterative/dvc/issues/9893
Investigate regressions in https://github.com/iterative/dvc-bench for: