pangeo-data / storage-benchmarks

testing performance of different storage layers
Apache License 2.0
12 stars 1 forks source link

Define test cases. #9

Open kaipak opened 6 years ago

kaipak commented 6 years ago

What are the combinations of backends, formats, and infrastructure that will be used for the test cases? e.g., Zarr -> GCSFS, hdf5 -> POSIX file, etc.

jakebolewski commented 6 years ago

In determining test cases for cloud vendors, one thing to keep in mind (if there is interest in a native C / C++ storage library backend) is that GCS's native library support for cloud storage is far behind AWS or Azure. Tensorflow (for example) uses libcurl directly for all storage backend interactions. This makes it much more difficult and time consuming to support GCS than Azure and / or AWS (and any interface will probably not be as optimized).

jreadey commented 6 years ago

Doesn't GCS support Amazon's S3 API? So it would seem test cases could just use the AWS SDK.

I don't know if there's a performance hit going that route.

stavrospapadopoulos commented 6 years ago

Compatibility between the AWS S3 SDK and GCS is not guaranteed 100%. I tried today the AWS S3 to GCS migration approach. It works for the most part (e.g., basic bucket/file/directory management), but GCS does not support the S3's multi-part upload functionality, which is important to TileDB for achieving high performance. This is not a showstopper - we just need to do a bit more work to implement this specific functionality (probably via libcurl like TensorFlow).

rabernat commented 6 years ago

Is there an issue for this within https://github.com/GoogleCloudPlatform somewhere we can link to?

stavrospapadopoulos commented 6 years ago

Not that I know of. However, the GCS docs mention that AWS S3's multi-part upload is essentially reduced to uploading separate objects and then calling the composition API. There is no equivalent functionality, so it is not surprising to me that S3's multi-part upload may not work on GCS. We (TileDB) will have to write some extra code to (efficiently) mimic it.

rabernat commented 6 years ago

FYI, we are not married to GCS. It sounds like it would make more sense to do the initial work on AWS. That's what NASA is committed to anyway.

On Tue, Mar 6, 2018 at 12:18 PM, Stavros Papadopoulos < notifications@github.com> wrote:

Not that I know of. However, the GCS docs mention https://cloud.google.com/storage/docs/migrating#methods-comparison that AWS S3's multi-part upload is essentially reduced to uploading separate objects and then calling the composition https://cloud.google.com/storage/docs/xml-api/put-object-compose API. There is no equivalent functionality, so it is not surprising to me that S3's multi-part upload may not work on GCS. We (TileDB) will have to write some extra code to (efficiently) mimic it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/storage-benchmarks/issues/9#issuecomment-370857881, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJFJufoXJyP0galIfwtpA24P2Rmp2T0ks5tbsT2gaJpZM4Sczvo .

kaipak commented 6 years ago

Here's the broad list of cases I've got so far. Please add more or correct my admittedly inexperienced ordering of things if I've got some of this wrong!

  1. netcdf -> POSIX -> local storage
  2. netcdf -> POSIX -> some sort of disk presentation layer (e.g. fuse) -> cloud bucket
  3. Zarr -> POSIX -> local storage
  4. Zarr -> cloud bucket
  5. h5netcdf -> POSIX -> local storage
  6. h5netcdf -> hsds -> cloud bucket
  7. TileDB (???)

For cloud, I'm mainly mean S3, Google, and Azure. I'm a little unsure as to where the tests will originate from (perhaps my wording is imprecise). There seems to be some debate around running tests from numpy, xarray, or dask objects which was mentioned on issue #6 although I like the approach that @jhamman and @rabernat suggests which is to test all in a tiered manner.

stavrospapadopoulos commented 6 years ago

TileDB works seamlessly with local, distributed and cloud storage. So we can have:

  1. TileDB -> cloud (currently only S3) TileDB -> POSIX (local, Lustre, etc) TileDB -> HDFS
jreadey commented 6 years ago

Should we add "h5netcdf -> POSIX -> fuse"?

Does 1 & 2 (or 3 & 4) need different test code or can it be the same?

Another type of test to consider would be something like: a) copy file from cloud storage to local disk b) do local posix c) remove file

I suspect in some cases this will prove faster than using FUSE.

kaipak commented 6 years ago

Should we add "h5netcdf -> POSIX -> fuse"?

Yep, I will add that.

Does 1 & 2 (or 3 & 4) need different test code or can it be the same?

I'm working on a prototype now, and thinking we separate the guts of setting up the backends and storage API calls from the actual tests themselves (broadly speaking). I think these cases can be roughly the same bits of code and we'd only have to change the target which may be a parameter.

Another type of test to consider would be something like: a) copy file from cloud storage to local disk b) do local posix c) remove file

I was thinking along these lines too. A consideration I've had with moving things around across networks is that bandwidth and latency would be a major factor on performance that could be hard to interpret depending on how it's presented.

I suspect in some cases this will prove faster than using FUSE.

From my limited knowledge of FUSE, I would think this is a reasonable guess.

kaipak commented 6 years ago

TileDB works seamlessly with local, distributed and cloud storage. So we can have:

TileDB -> cloud (currently only S3)
TileDB -> POSIX (local, Lustre, etc)
TileDB -> HDFS

@stavrospapadopoulos appreciate the input and will add to the list. I'll document all this in a more complete form sometime this week.

mrocklin commented 6 years ago

I suspect in some cases this will prove faster than using FUSE. From my limited knowledge of FUSE, I would think this is a reasonable guess.

I'll push back on this. Do we know of any fundamental limitations of why FUSE would perform worse or are we judging the idea of FUSE based on current shoddy implementations? My guess is that the performance of a FUSE solution will strongly depend on how much effort we place into it. My main concern about FUSE is that it is common enough no company has a vested interest in selling the technology.

mrocklin commented 6 years ago

cc @llllllllll who might find this repository of interest

mrocklin commented 6 years ago

To turn my previous statement into a question: "Are there fundamental limitations of buliding FUSE systems that would stop them from reaching network capacity on analytic workloads?"

jreadey commented 6 years ago

With FUSE each read op is relatively high latency (vs reading from local disk). With enough reads you may end up spending more time than if you had just copied the file in the first place.

In any case, it would seem reasonable to evaluate FUSE vs. copy+local access.

rabernat commented 6 years ago

It's great to hear this discussion! I'm excited to get some concrete results to test some of these ideas.

mrocklin commented 6 years ago

With FUSE each read op is relatively high latency (vs reading from local disk). With enough reads you may end up spending more time than if you had just copied the file in the first place.

I understand. Network access is certainly higher latency than local disk access (at least for the kinds of networks we're talking about). But all of these approaches involve the network. I don't think that this is unique to FUSE. I imagine that any quality solution here will be tuning block size parameters, caching, read-ahead etc.. Again I want to make it clear that I'm not asking about how fast a naive FUSE system would behave, but rather on fundamental performance limitations of a well-engineered FUSE system.

guillaumeeb commented 6 years ago

Doesn't it mainly depends of the kind of I/O, so read and write operations you will perform? So if you do big block sequential access over fuse, it will be okay, but NetCDF kind of access with random small access will be bad. I believe this is the point of @jreadey. It is also observed on HPC system with centralized storage (e.g. GPFS or Lustre), we often recommand user to copy the data on compute node local storage before accessing it, depending on their I/O profile.

mrocklin commented 6 years ago

but NetCDF kind of access with random small access will be bad

How we chunk datasets is, I think, a problem that all data formats will have to deal with. I would hope that NetCDF files stored on the cloud would, like any other format, be chunked in such a way so that we could avoid lots of small random access.

guillaumeeb commented 6 years ago

be chunked in such a way so that we could avoid lots of small random access

I don't know enough of NetCDF and chunking to be really relevant here, but I believe good chunking also depends on the algorithm using the data, so that it is difficult to achieve optimal chunking for all science use cases. It may also be needed to optimize the scientific algorithm so it is adapted to cloud storage system and FUSE, not only the NetCDF format and chunking. Maybe only those not optimized legacy algorithm will be worst using FUSE instead of downlaod + POSIX.

jreadey commented 6 years ago

The potential problem is that you do a bunch of metadata reads before you get to the actual chunks. This issue is exacerbated when you are doing lots of file opens (say pulling out a time series from hundreds of NetCDF files).

In any case, let's get some numbers to serve as a baseline.

mrocklin commented 6 years ago

but I believe good chunking also depends on the algorithm using the data, so that it is difficult to achieve optimal chunking for all science use cases.

Definitely. All file formats will have to think about this problem.

The potential problem is that you do a bunch of metadata reads before you get to the actual chunks. This issue is exacerbated when you are doing lots of file opens (say pulling out a time series from hundreds of NetCDF files).

@jreadey as you know I'm already very aware of these problems. I think we've had this discussion. You may want to take a look at http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud again?

jreadey commented 6 years ago

@mrocklin - yes, I'm just saying that we should add the file copy + local posix to the backend approaches.

mrocklin commented 6 years ago

No objection to that statement at all. My objection was to the following statement:

I suspect in some cases this will prove faster than using FUSE. From my limited knowledge of FUSE, I would think this is a reasonable guess.

My goal is to have people understand that no decent FUSE solution exists today that we should not judge this approach solely based on a naive implementation.

jreadey commented 6 years ago

What are the FUSE implementations we should be looking at? On AWS, the only one I'm familiar with is: https://github.com/s3fs-fuse/s3fs-fuse.

Not really FUSE, but AWS EFS serves the purpose of allowing HDF5Lib to work with files that are globally accessible. EFS cost 10x what S3 does, so it may be best to use EFS as a sort of cache.

Yet another approach would be HDF5Lib with a VFD. There are a couple of projects exploring this now, but they are pretty nascent.

mrocklin commented 6 years ago

I don't think that anyone has seriously invested in developing a good FUSE system for cloud storage that is well optimized for our workloads.

mrocklin commented 6 years ago

To be clear, I'm not suggesting any concrete action. My goal is to correct a view that FUSE is a bad idea because in the past FUSE has been a bad idea.

rabernat commented 6 years ago

My original idea when we first got started on GCP was to use persistent disk to mount a shared directly of netCDF files (see pangeo-data/pangeo#19). I guess this is analogous AWS EFS. The folks who actually set up the cluster ended up not going that route. Now that I see the parallels to FUSE, I am curious how it would have performed.

A big downside is that, If files are stored in persistent disk, they can't be accessed passively via the regular cloud storage API. In order to provide general access, one has to spin up a virtual machine, mount the filesystem, and serve the files to the outside world.

mrocklin commented 6 years ago

Persistent disk has a few serious problems:

  1. It is expensive
  2. It doesn't allow for concurrent write access
  3. As @rabernat says, it requires lots of special hooks to get at, and so is not easily accessible to drive-by users
jreadey commented 6 years ago

No, the persistent disk equivalent on AWS is EBS (Elastic Block Store) and has the problems that @mrocklin lists.

EFS is more like an NFS for the cloud, and should not have any issues with concurrent read access for NetCDF4 files.