pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
698 stars 188 forks source link

Initial storage-benchmark results #247

Closed kaipak closed 6 years ago

kaipak commented 6 years ago

I owe a final report for the semester to @rabernat so I'm going to have to keep this a bit short. More results and a complete report will be coming soon along with additional benchmarks. Also, apologies for the formatting as I haven't had the time to pretty up the plots yet.

To echo some of the promising results noted on #244, the storage speeds are pretty impressive so far. All tests were run using Airspeed Velocity along with some custom code to parse the results and generate plots. Everything is maintained in this repo. Each test was run five times with a mean taken from them. Future reports will have error bars. Real world read speed tests were done on an ocean general circulation model simulation output (LLC4320). Times are in seconds.

I'll start off with read speeds NetCDF on a FUSE mount. Note, on the x-axis, the label is something like FUSE-10-40, where the first number refers to chunk size, and the second refers to number of Dask workers (basically a concatenation of a few columns for plotting purposes). Without getting too much into the dataset, the dimensions for the chunks are essentially (X,Y,Z,T) where Z-chunks is modulated out of a total of 90 and there is one chunk per time slice (here, 35 total). This test reads in the data and computes a mean.

backend chunks n_workers GBytes time throughput config
FUSE 10 40 54.7 43.509156 1.26 FUSE-10-40
FUSE 10 80 54.7 25.293858 2.16 FUSE-10-80
FUSE 10 120 54.7 17.583160 3.11 FUSE-10-120
FUSE 90 40 54.7 60.217257 0.91 FUSE-90-40
FUSE 90 80 54.7 41.653534 1.31 FUSE-90-80
FUSE 90 120 54.7 29.666033 1.84 FUSE-90-120

screen shot 2018-05-10 at 3 07 04 pm

Not too shabby! This is VERY sensitive to chunk configuration and smaller than 10 performs pretty badly which may work against what would be better for Dask. This seems consistent with what we've seen with FUSE performance in that it tends to do better with larger read/writes.

Onto Zarr on Google Cloud Storage object store. This is reading 652 GB of data and again, computing a mean. Note, there is a fairly significant amount of time waiting for Dask to stage everything before reads start. I'll try to quantify this in the next round of results.

backend chunks n_workers GBytes time throughput config
GCS 1 20 652.298212 229.548107 2.84 GCS-1-20
GCS 1 40 652.298212 130.234234 5.01 GCS-1-40
GCS 1 80 652.298212 83.511436 7.81 GCS-1-80
GCS 1 120 652.298212 84.764547 7.70 GCS-1-120

screen shot 2018-05-10 at 3 18 52 pm

Seeing close to 8 GB/sec reads is quite impressive. I am not entirely sure why performance apparently got slightly worse at 120 workers. There are 417 time slices * 90 z slices = 38k chunks which is plenty to saturate the workers. One thing I have noticed---especially with larger number of workers---is that at the end of reading/writing a large amount of data, the collection process can often hang for very long periods of time (in some cases, several minutes) and that may have skewed results here.

Finally, onto some writes. This is writing an random Dask array ((1000, 3000, 3000)) which comes out 67 GB to a GCS/Zarr store. Out of all the tests, this is the flakiest of all and tends to fail a lot (missing bars). Still, results again are pretty impressive where I was getting around 6-7 GB/sec write speeds.

backend chunks n_workers GBytes time throughput config
GCS 1 20 67.055225 58.374857066800175 1.15 GCS-1-20
GCS 1 40 67.055225 31.393016962399997 2.14 GCS-1-40
GCS 1 80 67.055225 18.47140438639908 3.63 GCS-1-80
GCS 1 120 67.055225 None NaN GCS-1-120
GCS 5 20 67.055225 49.94385303000017 1.34 GCS-5-20
GCS 5 40 67.055225 32.21567161760031 2.08 GCS-5-40
GCS 5 80 67.055225 21.71223432520055 3.09 GCS-5-80
GCS 5 120 67.055225 17.0516856051996 3.93 GCS-5-120
GCS 10 20 67.055225 None NaN GCS-10-20
GCS 10 40 67.055225 20.445755813599682 3.28 GCS-10-40
GCS 10 80 67.055225 23.769491575598657 2.82 GCS-10-80
GCS 10 120 67.055225 10.211107931799779 6.57 GCS-10-120

screen shot 2018-05-10 at 3 30 45 pm

The hardest thing about running these tests is dealing with frequent failures although this has gotten a lot better as time has progressed and we've refined the Pangeo environment.

I also have a bunch of Xarray and Numpy centric single process tests that I still need to write up plots for. That'll come out in the next round!

jhamman commented 6 years ago

@kaipak - this is great. These results are quite promising and I'm excited to see more!

mrocklin commented 6 years ago

Indeed! I'm glad to see these as well.

Do you have a sense of the source of the errors? Is this coming from within gcsfs? I'm curious if the retries= keyword within Dask might be able to help with this (arguably we should document this better).

Was HSDS also measured?

alimanfoo commented 6 years ago

This is very cool to see!

For each benchmark I'd be very interested to know the dimensions (shape) of the array, the shape of the chunks, the uncompressed size of each chunk (should be product of chunk shape times dtype itemsize), the typical compressed size of a chunk, and the compressor and compression options that were used.

On Thu, 10 May 2018, 21:28 Matthew Rocklin, notifications@github.com wrote:

Indeed! I'm glad to see these as well.

Do you have a sense of the source of the errors? Is this coming from within gcsfs? I'm curious if the retries= keyword within Dask might be able to help with this (arguably we should document this better).

Was HSDS also measured?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/247#issuecomment-388175517, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qu4wPalomcgE1HXLuhy6l5le7olMks5txKLpgaJpZM4T6dw_ .

kaipak commented 6 years ago

Do you have a sense of the source of the errors? Is this coming from within gcsfs? I'm curious if the retries= keyword within Dask might be able to help with this (arguably we should document this better).

I listed a couple of the errors I was running into in #245 . I started writing a notebook to better capture these errors and provide some statistics which I will share in a more detailed preliminary report shortly.

Didn't know about the retries= thing! I will give that a shot as I also plan on having the tests conduct more sample runs in order to be more statistically significant.

Was HSDS also measured?

@jreadey and I are currently working on this. I am hoping to have this included in the next set of tests.

rabernat commented 6 years ago

@kaipak thanks so much for your hard work on this! Having some hard numbers is extremely valuable.

alimanfoo commented 6 years ago

One more thought, I guess you are measuring throughput as the uncompressed size of the data being read or written? It would be good to know the overall compression ratio of the data, so you could also measure the actual throughput of data from GCS to workers.

On 10 May 2018 at 21:41, Alistair Miles alimanfoo@googlemail.com wrote:

This is very cool to see!

For each benchmark I'd be very interested to know the dimensions (shape) of the array, the shape of the chunks, the uncompressed size of each chunk (should be product of chunk shape times dtype itemsize), the typical compressed size of a chunk, and the compressor and compression options that were used.

On Thu, 10 May 2018, 21:28 Matthew Rocklin, notifications@github.com wrote:

Indeed! I'm glad to see these as well.

Do you have a sense of the source of the errors? Is this coming from within gcsfs? I'm curious if the retries= keyword within Dask might be able to help with this (arguably we should document this better).

Was HSDS also measured?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/247#issuecomment-388175517, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qu4wPalomcgE1HXLuhy6l5le7olMks5txKLpgaJpZM4T6dw_ .

-- If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo

rabernat commented 6 years ago

@alimanfoo these are great questions. For now, I can give you the zarr info on one of the arrays:

Name               : /Theta
Type               : zarr.core.Array
Data type          : float32
Shape              : (6, 90, 2160, 2160)
Chunk shape        : (1, 1, 2160, 2160)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : gcsfs.mapping.GCSMap
No. bytes          : 10077696000 (9.4G)
Chunks initialized : 540/540

I don't know why it doesn't report the storage ratio (as in the zarr docs).

rabernat commented 6 years ago

@kaipak which dataset were you using for the 652 GB read? I can't find it in the repo.

alimanfoo commented 6 years ago

@rabernat the compression ratio is only available if the store class implements the getsize() method (e.g., DirectoryStore.getsize() implementation). I don't know if it's possible or practical to implement something like this for GCSMap? Alternatively I guess it should be possible to use something like gsutil du?

rabernat commented 6 years ago

The bigger dataset is at 'pangeo-data/storage-benchmarks/llc4320_zarr'

Name               : /Theta
Type               : zarr.core.Array
Data type          : float32
Shape              : (417, 90, 2160, 2160)
Chunk shape        : (1, 1, 2160, 2160)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : gcsfs.mapping.GCSMap
No. bytes          : 700399872000 (652.3G)
Chunks initialized : 10621/37530

Strangely it looks like it has not been fully initialized. That could be affecting the results.

I don't know if it's possible or practical to implement something like this for GCSMap?

Just FYI, I did implement getsize() in my experimental zarr GCS backend: https://github.com/zarr-developers/zarr/pull/252/files#diff-31d15042dbeedbf2942ace2ad4b9b2e2R2010

alimanfoo commented 6 years ago

Just FYI, I did implement getsize() in my experimental zarr GCS backend: https://github.com/zarr-developers/zarr/pull/252/files#diff- 31d15042dbeedbf2942ace2ad4b9b2e2R2010

Ah, yes of course, sorry forgot about that. If using that store implementation, calling array.info should report compression ratio.

rabernat commented 6 years ago

But these examples most definitely do not use that backend. They use the regular gcsfs approach.

On Fri, May 11, 2018 at 11:30 AM, Alistair Miles notifications@github.com wrote:

Just FYI, I did implement getsize() in my experimental zarr GCS backend: https://github.com/zarr-developers/zarr/pull/252/files#diff- 31d15042dbeedbf2942ace2ad4b9b2e2R2010

Ah, yes of course, sorry forgot about that. If using that store implementation, calling array.info should report compression ratio.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/247#issuecomment-388398490, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJFJps0Sc9kOJWr05NluKGY8O7JQZR3ks5txa57gaJpZM4T6dw_ .

kaipak commented 6 years ago

Well shoot, I am reuploading dataset now and will rerun the Zarr tests.

alimanfoo commented 6 years ago

But these examples most definitely do not use that backend. They use the regular gcsfs approach.

Yep, sorry for the noise. I meant if you wanted a quick way to get diagnostics on compression ratio, using your GCS store implementation would do it.

kaipak commented 6 years ago

I reuploaded the data and verified it's all there now:

screen shot 2018-05-14 at 7 39 17 am

I wanted to re-run the full set of tests this weekend but the cluster was unusually very busy and I was not able to get 120 workers for enough time unfortunately. I managed to get a couple test runs in with 116 nodes and got just a little over 8 GB/sec still. I'll try to rerun everything when things calm down a bit.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.