pangeo-data / ml-workflow-examples

Simple examples of data pipelines from xarray to ML training
Apache License 2.0
22 stars 10 forks source link

Updates to rasp-data-loading.ipynb #6

Open TomAugspurger opened 5 years ago

TomAugspurger commented 5 years ago

I'm playing with the example from https://github.com/pangeo-data/ml-workflow-examples/pull/2. See https://nbviewer.jupyter.org/gist/TomAugspurger/f23c5342bef938a120b83a11d1cae077 for the updates.

On this subset, it seems like the dask + xarray overhead over h5py is about 2x. I think this is pretty encouraging. It seems like it'll be common to make a pre-processing pass over the data to do a bunch of stuff before writing the data back to disk in a form that's friendly to the deep learning framework. In this case, the overhead is 2x for a single sample. With a little effort, we'll be able to process batches of samples at once, which I suspect will give us better parallelism.

Before I get too much further, can an xarray user check my work in https://nbviewer.jupyter.org/gist/TomAugspurger/f23c5342bef938a120b83a11d1cae077#XArray-based-Generator?

class DataGenerator2(DataGenerator):

    def __getitem__(self, index):
        time, lat, lon = self.get_indices(index)
        subset = self.ds.isel(time=xr.DataArray(time, dims='z'),
                              lat=xr.DataArray(lat, dims='z'),
                              lon=xr.DataArray(lon, dims='z'))
        X = xr.concat(subset[self.input_vars].to_array(), dim='lev')
        y = xr.concat(subset[self.output_vars].to_array(), dim='lev')

        return X, y

I also haven't done any real profiling yet, beyond glancing at the scheduler dashboard. We're getting good parallel reading and computation overlapping with reading. But since we're just processing a single sample right now, there isn't too much room for parallelism yet.

Thanks for the very clear examples @raspstephan.

raspstephan commented 5 years ago

Hi Tom,

thanks so much for looking at the example. I am a little busy at the moment preparing for my PhD defense in a week. After I will have more time to look at things.

I just wanted to ask whether it would be helpful to have a larger sample of data?

TomAugspurger commented 5 years ago

No worries, my update isn't really close to being done yet. I'm going to run through the entire training next (hopefully this weekend).

A larger dataset (something that doesn't fit in memory on a single machine) would be interesting, but no rush on that.

On Fri, Mar 8, 2019 at 6:03 AM Stephan Rasp notifications@github.com wrote:

Hi Tom,

thanks so much for looking at the example. I am a little busy at the moment preparing for my PhD defense in a week. After I will have more time to look at things.

I just wanted to ask whether it would be helpful to have a larger sample of data?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/ml-workflow-examples/issues/6#issuecomment-470906172, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIuE15fKtIg7DzkRw72HGwyMwduRzks5vUlGugaJpZM4biWOd .

raspstephan commented 5 years ago

Is there a convenient way for me to share the dataset with you (several 100G). I currently do not have a good option.

nbren12 commented 5 years ago

Maybe this is something that Pangeo would consider hosting. What do you think @jhamman @rabernat?

Otherwise, you could write a function to make a mock dataset with the same variable names and shapes etc.

rabernat commented 5 years ago

Absolutely we can host the data!

Sent from my iPhone

On Mar 8, 2019, at 1:44 PM, Noah D Brenowitz notifications@github.com wrote:

Maybe this is something that Pangeo would consider hosting. What do you think @jhamman @rabernat?

Otherwise, you could write a function to make a mock dataset with the same variable names and shapes etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.