Closed dberenbaum closed 1 year ago
Discussed before that they should be part of dvc.testing
@dberenbaum @efiop - Can we please add this for Q2 as roadmap item? It's also possible to get some help (from ex-CML?) with setting up automation for multi-cloud infra set-up
Still need to prioritize, but we have had cloud performance as a priority that we were hoping to get to even in Q1, so I think it makes sense to keep it for Q2 and include benchmarking.
Though dvc-bench has all the clouds hot-swappable by default as well (by design), so might shove that there. I guess this is going into the same discussion we had before about whether or not dvc-bench
should be part of dvc or a standalone project.
To clarify the requirements, the most important is to benchmark a realistic cloud remote (s3) more than to individually test each remote (which is needed but not as high priority IMO).
Discussed with @pmrowla that we will prioritize Azure and GCS but not cloud-versioned remotes (which are considerably more work and at some point should not work much differently from other remotes).
@dberenbaum the partial add/remove test cases for real S3 take >4 hours to run (and currently timeout) https://github.com/iterative/dvc-bench/actions/runs/4931640632/jobs/8813915046
Thinking about it some more, I'm not convinced that it is actually useful to benchmark those tests for real remotes, and I'm also not sure that the existing test_sharing
use case is useful for real remotes either.
It seems to me that all we actually want is separate simple real-remote test cases for push
, pull
and gc
, which are the only part of the operation that is affected by the underlying S3 filesystem. Testing the full "use case" benchmarks is not actually useful when it comes to real remotes.
Running the test_modify_data
(partial add/remove) use cases against S3 is essentially just benchmarking push
/pull
with a smaller number of files. Any actual performance regression in those use cases would also show up in the local remote benchmarks (i.e. because we introduced a bug that makes us push the entire dataset instead of only the modifications)
@dberenbaum the partial add/remove test cases for real S3 take >4 hours to run (and currently timeout) https://github.com/iterative/dvc-bench/actions/runs/4931640632/jobs/8813915046
This is with mnist data? Do you know how many total iterations are run?
I'm also not sure that the existing
test_sharing
use case is useful for real remotes either.
Looks like this also timed out, right?
It seems to me that all we actually want is separate simple real-remote test cases for
push
,pull
andgc
, which are the only part of the operation that is affected by the underlying S3 filesystem. Testing the full "use case" benchmarks is not actually useful when it comes to real remotes.
Does test_sharing
do much besides this anyway? I'm wondering if adding a simple test case would really help with the length of time. Maybe we should keep a historical daily record for these instead of recomputing every old version daily?
This is with mnist data? Do you know how many total iterations are run?
This is with mnist, we only run a single iteration per DVC revision we are trying to bench.
I'm also not sure that the existing
test_sharing
use case is useful for real remotes either.Looks like this also timed out, right?
It timed out that day, and did not the day before.
It seems to me that all we actually want is separate simple real-remote test cases for
push
,pull
andgc
, which are the only part of the operation that is affected by the underlying S3 filesystem. Testing the full "use case" benchmarks is not actually useful when it comes to real remotes.Does
test_sharing
do much besides this anyway? I'm wondering if adding a simple test case would really help with the length of time.
The issue is that test_sharing
does them in sequence within the same GHA job, it pushes to the bucket and then pulls from it. With separated tests we can run the push and pull cases in separate jobs.
Maybe we should keep a historical daily record for these instead of recomputing every old version daily?
This won't work for this type of benchmark. The actual runtime is dependent on things that will vary from day to day, so you cannot compare runtimes across different days.
Benchmarks are only useful as a relative comparison, and they are only useful when the conditions used to generate each point in the comparison were consistent (or at least as consistent as possible)
The issue is that
test_sharing
does them in sequence within the same GHA job, it pushes to the bucket and then pulls from it. With separated tests we can run the push and pull cases in separate jobs.
Got it. That approach sounds good. It's more important that we can test a larger dataset against real clouds, so a single push and pull per job makes sense.
By the way, why are we timing out at only 4 hours? I thought the default was 6 hours (https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes).
By the way, why are we timing out at only 4 hours? I thought the default was 6 hours (https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes).
The default expiration duration for the AWS token was 1 hour, the next configurable interval in the AWS console UI was 4 hours. We can extend this if needed, but I'd prefer optimizing the test jobs first
Thinking about this some more, with the current dvc-bench architecture we can't actually separate push/pull for real clouds. The dataset has to be pushed to the real remote in order for it to be pulled in the first place, so separating them won't actually save us anything over the existing test_sharing
workflow right now. We also have the overhead of needing to dvc pull
the base dataset from the public bucket (using the default read-only HTTP remote and not S3) during the overall setup phase.
What we probably need to do set up buckets containing the mnist dataset for each cloud type we want to benchmark, and then have specific tests that only does a single pull
from the appropriate bucket, and a single push
to a temp directory in the appropriate bucket. This would need to be separate from the existing remote
and dataset
fixtures in dvc.testing.
@pmrowla So should we create a separate issue for that and extend the timeout for now?
Do we benchmark cloud versioning?
It could serve to pseudo test cloud versioning in azure, by running against a real bucket because Azurite doesn't support it
@daavoo no, but https://github.com/iterative/dvc-bench/issues/408 is open and still on the list of potential follow ups for this
We need a way to benchmark in at least AWS, and ideally also in Azure and GCS.
Too often, we are missing realistic benchmarks for cloud-centric operations (for example, https://github.com/iterative/dvc/issues/9098).
Related: