ioos / APIRUS

API for Regular, Unstructured and Staggered model output (or API R US)
Creative Commons Zero v1.0 Universal
2 stars 1 forks source link

Should we use Dask or Biggus for our lazy evaluations? #5

Closed ocefpaf closed 9 years ago

ocefpaf commented 9 years ago

I tried to be smarter about dask chunking by breaking 3D arrays, like eta, from (855, 82, 130) to (1, 82, 130). That reduced the dask time from several minutes to 6.4 s when computing the vertical coordinates. The same task with biggus took 5.35 s.

Now that the times are virtually the same we have to ask which one we should use. The pros are:

Pinging @pelson here to see if he has any corrections, additions, or comments.

ChrisBarker-NOAA commented 9 years ago

biggus is designed pretty much to do what we want, yes? dask is really a different, more complex use-case. I'm all for the simpler the better.

It Seems dask only has the "cool" factor as an advantage -- so if biggus is well supported enough, which I think it is, I say go for it.

pelson commented 9 years ago

Dask, and Dask.array, are awesome. They are absolutely the right abstraction for the problem it is trying to solve, and it is the full control that dask gives that has made it easy for the skimge guys to pick up and run with for one of their image processing kernels. However, I do not think it is the right abstraction if all you want to think about is doing some transformation on your array, and not have to understand how these operations occur out of core.

Obviously, we've spoken a lot with the dask guys, and are pretty confident that it is trivial to make a simple (read: dumb) translation from biggus to dask.array (indeed, we have a proof of concept of that which we wrote when Matt Rocklin came to visit us in the UK). I'm also fairly confident that we can take some of the automatic chunking logic within biggus (which currently isn't very smart, incidentally) and apply it to that conversion.

Essentially, what I'm trying to say is that, if you want to not have to think (or know) about how your data is best chunked, biggus gives you the simpler interface. The key though is that in the future, you will be able to write biggus code and have that convert itself to dask graphs for execution (to give things such as distributed execution) - essentially it is the best of both worlds.

ocefpaf commented 9 years ago

I do not think it is the right abstraction if all you want to think about is doing some transformation on your array, and not have to understand how these operations occur out of core.

That is our problem here. We want to focus on clear coding and simple APIs. If we have to decide the chunking, where to call .compute(), etc we would be better using iris or xray directly. Our goal is not to re-invent the whole wheel! Maybe just some bolts :wink:

Essentially, what I'm trying to say is that, if you want to not have to think (or know) about how your data is best chunked, biggus gives you the simpler interface.

Yep, that is pretty much it. Right now biggus is working as replacement for NumPy array without modifying the underlying code. That is good enough for us.

The key though is that in the future, you will be able to write biggus code and have that convert itself to dask graphs for execution (to give things such as distributed execution) - essentially it is the best of both worlds.

That will be awesome. When we are going from "big to small" the tools we have are OK. However, when scientist in our field are performing "big to big" operations they usually drop Python and use cdo, nco, etc. I guess that is a case when dask distributed execution will make a difference.

@pelson thanks the explanation! We always learn a lot with your comments.

ocefpaf commented 9 years ago

I am going with biggus and closing this. If someone has a different opinion just re-open the issue and let us.