Closed ocefpaf closed 9 years ago
biggus is designed pretty much to do what we want, yes? dask is really a different, more complex use-case. I'm all for the simpler the better.
It Seems dask only has the "cool" factor as an advantage -- so if biggus is well supported enough, which I think it is, I say go for it.
Dask, and Dask.array, are awesome. They are absolutely the right abstraction for the problem it is trying to solve, and it is the full control that dask gives that has made it easy for the skimge guys to pick up and run with for one of their image processing kernels. However, I do not think it is the right abstraction if all you want to think about is doing some transformation on your array, and not have to understand how these operations occur out of core.
Obviously, we've spoken a lot with the dask guys, and are pretty confident that it is trivial to make a simple (read: dumb) translation from biggus to dask.array (indeed, we have a proof of concept of that which we wrote when Matt Rocklin came to visit us in the UK). I'm also fairly confident that we can take some of the automatic chunking logic within biggus (which currently isn't very smart, incidentally) and apply it to that conversion.
Essentially, what I'm trying to say is that, if you want to not have to think (or know) about how your data is best chunked, biggus gives you the simpler interface. The key though is that in the future, you will be able to write biggus code and have that convert itself to dask graphs for execution (to give things such as distributed execution) - essentially it is the best of both worlds.
I do not think it is the right abstraction if all you want to think about is doing some transformation on your array, and not have to understand how these operations occur out of core.
That is our problem here. We want to focus on clear coding and simple APIs. If we have to decide the chunking, where to call .compute()
, etc we would be better using iris
or xray
directly. Our goal is not to re-invent the whole wheel! Maybe just some bolts :wink:
Essentially, what I'm trying to say is that, if you want to not have to think (or know) about how your data is best chunked, biggus gives you the simpler interface.
Yep, that is pretty much it. Right now biggus is working as replacement for NumPy array without modifying the underlying code. That is good enough for us.
The key though is that in the future, you will be able to write biggus code and have that convert itself to dask graphs for execution (to give things such as distributed execution) - essentially it is the best of both worlds.
That will be awesome. When we are going from "big to small" the tools we have are OK. However, when scientist in our field are performing "big to big" operations they usually drop Python and use cdo
, nco
, etc. I guess that is a case when dask distributed execution will make a difference.
@pelson thanks the explanation! We always learn a lot with your comments.
I am going with biggus and closing this. If someone has a different opinion just re-open the issue and let us.
I tried to be smarter about dask chunking by breaking 3D arrays, like
eta
, from(855, 82, 130)
to(1, 82, 130)
. That reduced the dask time from several minutes to6.4 s
when computing the vertical coordinates. The same task with biggus took5.35 s
.Now that the times are virtually the same we have to ask which one we should use. The pros are:
Dask.array does not support any operation where the resulting shape depends on the values of the array.
That means we have to call.compute()
in some places. With biggus we do not need to consolidate the array before hand.Bag
andDataFrame
objects. We will probably never use those in APIRUS, but they are nice things to have.Pinging @pelson here to see if he has any corrections, additions, or comments.