Open TomAugspurger opened 5 years ago
Hi Tom,
thanks so much for looking at the example. I am a little busy at the moment preparing for my PhD defense in a week. After I will have more time to look at things.
I just wanted to ask whether it would be helpful to have a larger sample of data?
No worries, my update isn't really close to being done yet. I'm going to run through the entire training next (hopefully this weekend).
A larger dataset (something that doesn't fit in memory on a single machine) would be interesting, but no rush on that.
On Fri, Mar 8, 2019 at 6:03 AM Stephan Rasp notifications@github.com wrote:
Hi Tom,
thanks so much for looking at the example. I am a little busy at the moment preparing for my PhD defense in a week. After I will have more time to look at things.
I just wanted to ask whether it would be helpful to have a larger sample of data?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/ml-workflow-examples/issues/6#issuecomment-470906172, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIuE15fKtIg7DzkRw72HGwyMwduRzks5vUlGugaJpZM4biWOd .
Is there a convenient way for me to share the dataset with you (several 100G). I currently do not have a good option.
Maybe this is something that Pangeo would consider hosting. What do you think @jhamman @rabernat?
Otherwise, you could write a function to make a mock dataset with the same variable names and shapes etc.
Absolutely we can host the data!
Sent from my iPhone
On Mar 8, 2019, at 1:44 PM, Noah D Brenowitz notifications@github.com wrote:
Maybe this is something that Pangeo would consider hosting. What do you think @jhamman @rabernat?
Otherwise, you could write a function to make a mock dataset with the same variable names and shapes etc.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I'm playing with the example from https://github.com/pangeo-data/ml-workflow-examples/pull/2. See https://nbviewer.jupyter.org/gist/TomAugspurger/f23c5342bef938a120b83a11d1cae077 for the updates.
On this subset, it seems like the dask + xarray overhead over h5py is about 2x. I think this is pretty encouraging. It seems like it'll be common to make a pre-processing pass over the data to do a bunch of stuff before writing the data back to disk in a form that's friendly to the deep learning framework. In this case, the overhead is 2x for a single sample. With a little effort, we'll be able to process batches of samples at once, which I suspect will give us better parallelism.
Before I get too much further, can an xarray user check my work in https://nbviewer.jupyter.org/gist/TomAugspurger/f23c5342bef938a120b83a11d1cae077#XArray-based-Generator?
I also haven't done any real profiling yet, beyond glancing at the scheduler dashboard. We're getting good parallel reading and computation overlapping with reading. But since we're just processing a single sample right now, there isn't too much room for parallelism yet.
Thanks for the very clear examples @raspstephan.