iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.79k stars 1.18k forks source link

cloud benchmarks #9108

Closed dberenbaum closed 1 year ago

dberenbaum commented 1 year ago

We need a way to benchmark in at least AWS, and ideally also in Azure and GCS.

Too often, we are missing realistic benchmarks for cloud-centric operations (for example, https://github.com/iterative/dvc/issues/9098).

Related:

efiop commented 1 year ago

Discussed before that they should be part of dvc.testing

omesser commented 1 year ago

@dberenbaum @efiop - Can we please add this for Q2 as roadmap item? It's also possible to get some help (from ex-CML?) with setting up automation for multi-cloud infra set-up

dberenbaum commented 1 year ago

Still need to prioritize, but we have had cloud performance as a priority that we were hoping to get to even in Q1, so I think it makes sense to keep it for Q2 and include benchmarking.

efiop commented 1 year ago

Though dvc-bench has all the clouds hot-swappable by default as well (by design), so might shove that there. I guess this is going into the same discussion we had before about whether or not dvc-bench should be part of dvc or a standalone project.

dberenbaum commented 1 year ago

To clarify the requirements, the most important is to benchmark a realistic cloud remote (s3) more than to individually test each remote (which is needed but not as high priority IMO).

dberenbaum commented 1 year ago

Discussed with @pmrowla that we will prioritize Azure and GCS but not cloud-versioned remotes (which are considerably more work and at some point should not work much differently from other remotes).

pmrowla commented 1 year ago

@dberenbaum the partial add/remove test cases for real S3 take >4 hours to run (and currently timeout) https://github.com/iterative/dvc-bench/actions/runs/4931640632/jobs/8813915046

Thinking about it some more, I'm not convinced that it is actually useful to benchmark those tests for real remotes, and I'm also not sure that the existing test_sharing use case is useful for real remotes either.

It seems to me that all we actually want is separate simple real-remote test cases for push, pull and gc, which are the only part of the operation that is affected by the underlying S3 filesystem. Testing the full "use case" benchmarks is not actually useful when it comes to real remotes.

Running the test_modify_data (partial add/remove) use cases against S3 is essentially just benchmarking push/pull with a smaller number of files. Any actual performance regression in those use cases would also show up in the local remote benchmarks (i.e. because we introduced a bug that makes us push the entire dataset instead of only the modifications)

dberenbaum commented 1 year ago

@dberenbaum the partial add/remove test cases for real S3 take >4 hours to run (and currently timeout) https://github.com/iterative/dvc-bench/actions/runs/4931640632/jobs/8813915046

This is with mnist data? Do you know how many total iterations are run?

I'm also not sure that the existing test_sharing use case is useful for real remotes either.

Looks like this also timed out, right?

It seems to me that all we actually want is separate simple real-remote test cases for push, pull and gc, which are the only part of the operation that is affected by the underlying S3 filesystem. Testing the full "use case" benchmarks is not actually useful when it comes to real remotes.

Does test_sharing do much besides this anyway? I'm wondering if adding a simple test case would really help with the length of time. Maybe we should keep a historical daily record for these instead of recomputing every old version daily?

pmrowla commented 1 year ago

This is with mnist data? Do you know how many total iterations are run?

This is with mnist, we only run a single iteration per DVC revision we are trying to bench.

I'm also not sure that the existing test_sharing use case is useful for real remotes either.

Looks like this also timed out, right?

It timed out that day, and did not the day before.

It seems to me that all we actually want is separate simple real-remote test cases for push, pull and gc, which are the only part of the operation that is affected by the underlying S3 filesystem. Testing the full "use case" benchmarks is not actually useful when it comes to real remotes.

Does test_sharing do much besides this anyway? I'm wondering if adding a simple test case would really help with the length of time.

The issue is that test_sharing does them in sequence within the same GHA job, it pushes to the bucket and then pulls from it. With separated tests we can run the push and pull cases in separate jobs.

Maybe we should keep a historical daily record for these instead of recomputing every old version daily?

This won't work for this type of benchmark. The actual runtime is dependent on things that will vary from day to day, so you cannot compare runtimes across different days.

Benchmarks are only useful as a relative comparison, and they are only useful when the conditions used to generate each point in the comparison were consistent (or at least as consistent as possible)

dberenbaum commented 1 year ago

The issue is that test_sharing does them in sequence within the same GHA job, it pushes to the bucket and then pulls from it. With separated tests we can run the push and pull cases in separate jobs.

Got it. That approach sounds good. It's more important that we can test a larger dataset against real clouds, so a single push and pull per job makes sense.

By the way, why are we timing out at only 4 hours? I thought the default was 6 hours (https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes).

pmrowla commented 1 year ago

By the way, why are we timing out at only 4 hours? I thought the default was 6 hours (https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepstimeout-minutes).

The default expiration duration for the AWS token was 1 hour, the next configurable interval in the AWS console UI was 4 hours. We can extend this if needed, but I'd prefer optimizing the test jobs first

pmrowla commented 1 year ago

Thinking about this some more, with the current dvc-bench architecture we can't actually separate push/pull for real clouds. The dataset has to be pushed to the real remote in order for it to be pulled in the first place, so separating them won't actually save us anything over the existing test_sharing workflow right now. We also have the overhead of needing to dvc pull the base dataset from the public bucket (using the default read-only HTTP remote and not S3) during the overall setup phase.

What we probably need to do set up buckets containing the mnist dataset for each cloud type we want to benchmark, and then have specific tests that only does a single pull from the appropriate bucket, and a single push to a temp directory in the appropriate bucket. This would need to be separate from the existing remote and dataset fixtures in dvc.testing.

dberenbaum commented 1 year ago

@pmrowla So should we create a separate issue for that and extend the timeout for now?

daavoo commented 1 year ago

Do we benchmark cloud versioning?

It could serve to pseudo test cloud versioning in azure, by running against a real bucket because Azurite doesn't support it

pmrowla commented 1 year ago

@daavoo no, but https://github.com/iterative/dvc-bench/issues/408 is open and still on the list of potential follow ups for this