Hero workload brainstorm

jrbourbeau commented 3 years ago

Yesterday in the group meeting it was mentioned that we should come up with a so-called "hero workload" that both

Produces a useful result that folks will find interesting as well as
Really shows off the power of what we can do with the PyData stack today

I'm opening this issue as a place for us to brainstorm about what such a workload might look like. @rabernat @jbusecke do you have any initial thoughts about huge datasets and/or computation-heavy workloads the geoscience community would be excited to see us work on?

jrbourbeau commented 3 years ago

@rabernat @jbusecke there's the Dask Distributed Summit coming up this May. From the summit's landing page, it's a virtual conference "where users, contributors, and newcomers can share experiences to learn from one another and grow together".

This would be a great space for Pangeo folks to share their experience using Dask. For example, we could share the results of a hero workload, or if we're not quite there yet, discuss some of the successes and pain points that we've encountered so far as part of such a workload.

The CFP closes in a couple of days (May 28) and the talks themselves (30 minutes) are May 19-21. Do either of you, or other Pangeo folks, have an interest in submitting a talk?

rabernat commented 3 years ago

Hi @jrbourbeau! I'm a bit confused about the Dask Summit. There are two "workshop proposals" that I know about: Pangeo and Xarray. But I haven't seen an actual call for talk proposals. Did we miss that deadline?

rabernat commented 3 years ago

Regarding the hero workload, I have an idea coming together. The core of the method is described in this recent paper by @dhruvbalwada. We want to compute the vorticity / strain / divergence joint distribution from ocean surface velocities. However, we want to do it using the dataset described here - https://medium.com/pangeo/petabytes-of-ocean-data-part-1-nasa-ecco-data-portal-81e3c5e077be - which is mirrored on google cloud here - https://catalog.pangeo.io/browse/master/ocean/LLC4320/. We could also introduce some scale dependence into the calculation by doing filtering with the new software package coming together here - https://github.com/ocean-eddy-cpt/gcm-filters/.

Our default way of implementing these calculations - using xgcm - would not be able to leverage GPUs. However, we could easily hand code some of the routines using cupy.

One big challenge is that the data live in Google Cloud US-Central region. Would it be possible to get a Coiled GPU dask cluster going there?

jrbourbeau commented 3 years ago

In addition to workshops there are also talks and tutorials at the Dask summit. Apologies, I probably should have posted something earlier here about the summit. Fortunately, I submitted an aspirational talk about processing a petabyte of data with Dask in the cloud. The hero workload sounds like it would be a good fit for that talk : )

Would it be possible to get a Coiled GPU dask cluster going there?

We're in the process of rolling out GCP support in US-Central at the moment -- I can let you know when things are up and running

Also cc'ing @gjoseph92 for visibility

rabernat commented 3 years ago

Following up on the conversation at today's meeting. The idea is to do the following

Split the domain into roughly 5x5-degree block. (Possibly use rechunker to write a copy of the data with optimized chunk structure. Or try without rechunking to disk and see what happens...)
Calculate vorticity, strain, and divergence from the surface velocity fields using xgcm. (This will provide a chance to optimize these operations in terms of their dask graph.)
Calculate the 3D vorticity / strain / divergence histogram for each box.
Persist the results for further analysis.

We could also consider an interactive version of this, where a user can draw a bounding box on a map and then have the histogram generated on the fly! 😱 🤩

I propose that anyone interested (cc @ocean-transport/members + @dhruvbalwada + @jrbourbeau) have a meeting dedicated to this topic.

I am available this Friday 12-1pm and 2:45-3:30pm (EST). If those times don't work, we can look at next week.

dhruvbalwada commented 3 years ago

I can go both those times this week, with a slight preference for the second slot.

jbusecke commented 3 years ago

I am very interested and can do both slots.

jbusecke commented 3 years ago

I combed throught the xgcm issue tracker and I think it might be helpful to fasttrack this issue.

My thinking was that there might be opportunities to optimize these higher level operators without calling the lower level ones (diff/interp) in a chained manner? But for that to work it might be helpful to have this functionality implemented and properly tested, so that we have a "ground truth" before messing with the internals. Just a thought we can chat about it during the meeting.

hscannell commented 3 years ago

I would like to attend and both time slots work for me as well. I will read Dhruv's paper before then.

TomNicholas commented 3 years ago

I am also interested, free both times, but need to read Dhruv's paper to understand!

jrbourbeau commented 3 years ago

Count me in too -- both time slots this Friday work for me

rabernat commented 3 years ago

Ok invite sent for 2:45.

@dhruvbalwada - perhaps you could walk us through the code you used in your paper? No need to prepare anything fancy though...

cspencerjones commented 3 years ago

Seems like there might be an overlap with what Qiyu is already doing? At least maybe worth inviting him...

dhruvbalwada commented 3 years ago

Good idea, he just started looking at the vorticity-strain histograms in llc4320 to compare against idealized simulations with seasonality.

@qyxiao are you available tomorrow at 2:45pm eastern time?

qyxiao commented 3 years ago

Hi Dhruv, Yes I am, what is this about?

On Thu, Apr 15, 2021, 4:59 PM Dhruv Balwada @.***> wrote:

Good idea, he just started looking at the vorticity-strain histograms in llc4320 to compare against idealized simulations with seasonality.

@qyxiao https://github.com/qyxiao are you available tomorrow at 2:45pm eastern time?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ocean-transport/coiled_collaboration/issues/5#issuecomment-820725370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEH3I7CHN4RQYNVVYMWYFXTTI5HP7ANCNFSM4ZJONWGA .

dhruvbalwada commented 3 years ago

I have shared meeting invite with Qiyu, so he can join in as well.

rabernat commented 3 years ago

Seems like there might be an overlap with what Qiyu is already doing? At least maybe worth inviting him...

Thanks so much @cspencerjones for making this connection. It's actually perfect because we are looking for a scientific lead for this project. It sounds like @qyxiao is a natural fit for this. Just have to make all our goals are aligned.

rabernat commented 3 years ago

So in #12, I provided a simple example how xgcm's operations can be optimized. (This was my TODO item from our last meeting.) I'm trying to figure out the next step: how to actually get such optimizations to work via xgcm. I see a few different paths we could take:

Modify xgcm to use some kind of manual inlining / compiling of consecutive operations
Modify xgcm to use a "kernel" based approach, like gcm-filters, plus map_blocks
Don't change xgcm (much) and instead rely on dask to make these optimizations

I'd be curious to hear the Coiled folks' views on the feasibility of 3, using my example notebook as a reference problem.

rabernat commented 3 years ago

FYI, I also just shared an example of how to do an "interp with land" operation using the kernel + apply_ufunc approach here: https://github.com/xgcm/xgcm/issues/324

ocean-transport / coiled_collaboration

Hero workload brainstorm #5