pangeo-data / storage-benchmarks

testing performance of different storage layers
Apache License 2.0
12 stars 1 forks source link

Infrastructure Requirements and Architecture #10

Closed kaipak closed 6 years ago

kaipak commented 6 years ago

With all the benchmarks that are going to be run, I think it'll be helpful to define where all this stuff will live and what the infrastructure (ephemeral as it may be) will look like. I'm thinking we run these tests in some consistent manner and store results somewhere persistently. For example, have the same test data live in S3, Google, and Azure buckets, have a virtual machine spin up, check out this repo, then run tests. It would be great to set up some kind of standard server as well so we can get results from more conventional server/local storage configurations.

kaipak commented 6 years ago

I notice this is wrapped around issue #5 so perhaps we can kill two birds with one stone.

rabernat commented 6 years ago

I think the way to proceed is for the benchmark suite to be runnable from any location. One can connect to S3 / GCS from anywhere, not just within AWS / GCP, provided that proper credentials are available. This will make it easiest to develop. If certain things are not available in that environment (e.g. FUSE), the benchmark should simply be skipped. (Skipping can be triggered by raising a NotImlementedError in the setup method; http://asv.readthedocs.io/en/latest/writing_benchmarks.html#setup-and-teardown-functions)

For local file access, perhaps we can use an environment variable to point to the location of the data files.

When it comes time to actually run within the cloud, we can automate that process, possiblt using docker. Perhaps we should create a docker image from which the tests are run (even on a local machine), just to ensure uniformity.

kaipak commented 6 years ago

I suppose we could even make the cloud buckets public read only. Not sure if there would be any objection to this.

jreadey commented 6 years ago

I don't think there's much value in running tests from outside the data center where the bucket is located. So I think tests will need a set a file paths that's particular to where the test is running.

Shouldn't be too hard to set things up where the filepaths are stored outside the test itself, so that say a test running on Google would get a key to a GCS bucket and a test running in AWS would get a path to a S3 bucket.