scikit-hep / histbook

Versatile, high-performance histogram toolkit for Numpy.
BSD 3-Clause "New" or "Revised" License
109 stars 9 forks source link

does it work with dask arrays? #8

Open rabernat opened 6 years ago

rabernat commented 6 years ago

I just discovered this very cool-looking package thanks to your tweet. This package could potentially have great value in the climate / atmosphere / ocean space, where our data tends to live in xarray / dask data structures.

Quick question: does it work with dask arrays? My naive reading of the code is that you coerce inputs explicitly to numpy. If not, is that on your roadmap?

jpivarski commented 6 years ago

Oh, that's a good point. I can imagine wanting to do that.

The calculation layer was designed with the assumption of Numpy arrays throughout, though it's supposed to be modular enough to swap the entire thing with, say, pycuda.GPUArrays (replacing all function definitions) or PySpark Column expressions, etc.

It might make more sense for me to identify that they're Dask arrays early, then iterate through slices of them at the chunksize, so that all the internal calculations can be performed on Numpy slices. This would turn the Dask case into something resembling the original intention: calling fill in a loop over data chunks, but within the Dask abstraction where the user doesn't have to write the explicit loop.

rabernat commented 6 years ago

Thanks for your quick reply!

It might make more sense for me to identify that they're Dask arrays early, then iterate through slices of them at the chunksize, so that all the internal calculations can be performed on Numpy slices.

All internal calculations in dask are already performed on numpy slices; that's how dask works! In general, much numpy code "just works" with dask arrays, since dask supports most of the numpy API.

For example, dask already has a histogram function that works on dask arrays: http://dask.pydata.org/en/latest/array-api.html#dask.array.histogram

It might be easier than you think.

Not an urgent issue! Just something to think about.

jpivarski commented 6 years ago

Oh, I was imagining that, e.g. numpy.discretize(...) or other explicitly Numpy functions wouldn't actually work, that I'd need the Dask version. If not, that's great!

My choice might boil down to whether I want to pass any object through, trusting that it will work via duck typing, or help the user with early checks (Numpy or Dask). At least, I have to handle arrays vs scalar numbers.

jpivarski commented 6 years ago

I looked into it with that goal in mind. Dask arrays do indeed recognize Numpy functions, including things like

numpy.cumsum(daskarray)

(I'm not sure how Dask convinced a Numpy function to return a new Dask array.)

Nevertheless, it's going to take some thought to do it right even though Dask can do a lot of the calculations transparently. At the very least, it will take some thinking to be sure that it's doing the right thing, even if it technically works out of the box.

At some point in a Dask calculation, we have to call .compute(). The option I described above would be to do this before passing through the histbook compute layer— it would pull slices of the Dask array down from wherever they are and do a sequential calculation locally. That could be terrible: sometimes the reason you have a Dask array is to do distributed calculations over networked machines.

The key line in filling a histogram is

https://github.com/diana-hep/histbook/blob/master/histbook/hist.py#L562

where all independent variables have been calculated, discretized, and collected into a broadcast for bincontents.add.at(...). Perhaps Dask's .compute() could go right before that broadcast. In that case, Dask handles most of the calculation, possibly distributing it, definitely chunking it, but then I'd be asking for all of the integer indexes to be retrieved for a local broadcast. That doesn't scale to huge arrays.

What I'd really want, if I had a distributed Dask array of data, is for the histogram-filling to be parallel and combine histograms at the end (sending filled histograms over the network, rather than data that scale with the number of events). In that case, though, I still can't use Dask transparently, since the bin contents array filling and reduction tree need to be Daskified.

This is a good use-case and I'll consider it "on the list" to the same extent as PySpark Column expressions, which I also have a request for from Histogrammar users. The bottom line is that histbook's calculation layer is modular and different backends are foreseen. Permitting Dask is not as simple as removing the guard against non-Numpy arrays, though.

rabernat commented 6 years ago

Just pinging a few dask developers, who might have better advice than I do: @mrocklin, @jakirkham, @jcrist.

mrocklin commented 6 years ago

This NEP and discussion are probably relevant

http://numpy-discussion.10968.n7.nabble.com/NEP-Dispatch-Mechanism-for-NumPy-s-high-level-API-td45638.html

Also blogpost here: http://matthewrocklin.com/blog/work/2018/05/27/beyond-numpy

jpivarski commented 6 years ago

Thanks! It must be working via __array_ufunc__; I didn't know about that. I hope the enhancement to Numpy (__array_function__) works out: this is a great tool for making arrayish libraries transparent.

I don't have a formal roadmap for histbook yet, but it will include handling Dask arrays. histbook was designed for out-of-memory analytics (for HEP, which means ROOT files). As part of today's tests, I finished uproot's Dask functionality (to generate test cases).