pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem
https://sparse.pydata.org
BSD 3-Clause "New" or "Revised" License
591 stars 124 forks source link

Support Everything that XArray Expects #1

Open mrocklin opened 7 years ago

mrocklin commented 7 years ago

https://github.com/bolt-project/bolt/issues/58

shoyer commented 6 years ago

I updated the check-list (apparently I'm an "Owner" for pydata) to drop "insert/fancy indexing" and add a three-item checklist for indexing instead.

hameerabbasi commented 6 years ago

Do we have partial XArray compatibility or does XArray fail with sparse arrays outright?

shoyer commented 6 years ago

@hameerabbasi Not yet, see https://github.com/pydata/xarray/issues/1375. But sparse is close enough that it might be worth trying soon.

From a usability perspective, the most helpful additional feature would be nan-skipping aggregations. I know sparse will probably rarely be used with NaNs, but xarray's aggregation methods default to skipna=True so this would be valuable for consistency.

In addition to the direct uses, xarray uses where and outer/vectorized indexing internally primarily for alignment with join='outer' (e.g., in concat or merge). This sort of alignment would require making sparse arrays dense, unless they have a fill value of NaN rather than 0. So these operations are less essential than they might otherwise seem.

hameerabbasi commented 6 years ago

I just tested with your example code, NaN-skipping aggregations use ufuncs with reduce, so they should already be supported in theory.

>>> class A(np.ndarray):
...     def __array_ufunc__(self, *args, **kwargs):
...         print('ufunc', args, kwargs)
...         
>>> np.nansum(np.arange(5).view(A))
ufunc (<ufunc 'add'>, 'reduce', A([0, 1, 2, 3, 4])) {'axis': None, 'dtype': None, 'keepdims': False}

Although I have absolutely now idea how to discern that this was NaN skipping from the args/kwargs.

hameerabbasi commented 6 years ago

Ah, it was a dtype issue.

>>> np.nansum(np.arange(5, dtype=np.float).view(A))
ufunc (<ufunc 'isnan'>, '__call__', A([0., 1., 2., 3., 4.])) {}
ufunc (<ufunc 'add'>, 'reduce', A([0., 1., 2., 3., 4.])) {'axis': None, 'dtype': None, 'keepdims': False}

So we're good to go on those already. :-)

Edit: Concrete example:

>>> ar = sparse.DOK((2, 2))
>>> ar[1, 1] = np.nan
>>> s = sparse.COO(ar)
>>> np.nansum(s)
0.0
mrocklin commented 6 years ago

FWIW this is a great demonstration of the value of NumPy's use of protocols. It would be great if we could get an alternative ndarray implementation, sparse, to work with a complex downstream user of NumPy code, XArray, without either library having to explicitly know about the other. cc @njsmith

hameerabbasi commented 6 years ago

I was also considering sparse arrays with different fill values. However, some operations (such as dot) blow up in complexity, and on others, it's fairly easy to support.

shoyer commented 6 years ago

NumPy's nan-aggregations work, but only by converting sparse arrays into numpy arrays, inside np.lib.nanfunctions._replace_nan:

import numpy as np
import sparse

class NoncoercibleCOO(sparse.COO):
  def __array__(self, *args, **kwargs):
    raise Exception('cannot convert to numpy')

ar = sparse.DOK((2, 2))
ar[1, 1] = np.nan
s = NoncoercibleCOO(ar)
np.nansum(s)

This results in:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-13-481c3bc38f6d> in <module>()
      9 ar[1, 1] = np.nan
     10 s = NoncoercibleCOO(ar)
---> 11 np.nansum(s)

/usr/local/lib/python3.6/dist-packages/numpy/lib/nanfunctions.py in nansum(a, axis, dtype, out, keepdims)
    579 
    580     """
--> 581     a, mask = _replace_nan(a, 0)
    582     return np.sum(a, axis=axis, dtype=dtype, out=out, keepdims=keepdims)
    583 

/usr/local/lib/python3.6/dist-packages/numpy/lib/nanfunctions.py in _replace_nan(a, val)
     62 
     63     """
---> 64     a = np.array(a, subok=True, copy=True)
     65 
     66     if a.dtype == np.object_:

<ipython-input-13-481c3bc38f6d> in __array__(self, *args, **kwargs)
      4 class NoncoercibleCOO(sparse.COO):
      5   def __array__(self, *args, **kwargs):
----> 6     raise Exception('cannot convert to numpy')
      7 
      8 ar = sparse.DOK((2, 2))

Exception: cannot convert to numpy

So this really isn't a good solution yet :).

You might actually view this as the flip side of implementing protocols (like __array__): you get default implementations, but they aren't always good ones and it's not always obvious when that's the case.

mrocklin commented 6 years ago

Is this already reported upstream in NumPy?

njsmith commented 6 years ago

Hmm. So for this what we really want is something like: implement nansum using which=, and make reductions support which=?

shoyer commented 6 years ago

I was also considering sparse arrays with different fill values. However, some operations (such as dot) blow up in complexity, and on others, it's fairly easy to support.

I think this could be quite interesting indeed. NaN (which pandas's SparseArray uses) is the most obvious other default value to support. Even dot could be make to work if you make a nan-skipping version (nandot, nanmatmul?).

Is this already reported upstream in NumPy?

I don't know if this really qualifies as a bug, so much as an unintended consequence. I think the easiest fix would be to replace np.copyto() in _replace_nan with a protocol dispatching version of the three argument variant of np.where().

hameerabbasi commented 6 years ago

A simpler option would be to add skipnan= in ufunc reduce and accumulate methods.

Edit: And reduceat.

Edit: Or maybe in ufuncs themselves. ufunc(NaN, NaN) = NaN, otherwise, return the non-NaN item.

hameerabbasi commented 6 years ago

I think this could be quite interesting indeed. NaN (which pandas's SparseArray uses) is the most obvious other default value to support. Even dot could be make to work if you make a nan-skipping version (nandot, nanmatmul?).

I meant supporting arbitrary fill values. This would make many dense operations automatically possible, e.g. y = x + 5, y would have a fill value of 5.

You might, for example, do x[x == 0] = np.nan (please note, multidimensional boolean indexing not currently supported) and it would work.

shoyer commented 6 years ago

I meant supporting arbitrary fill values. This would make many dense operations automatically possible, e.g. y = x + 5, y would have a fill value of 5.

Yes, I see the elegance in that.

NaN has a nice property (like zero) that NaN * anything -> NaN. This could make some operations easier to implement if the general fill value is not viable.

hameerabbasi commented 6 years ago

NaN has a nice property (like zero) that NaN * anything -> NaN. This could make some operations easier to implement if the general fill value is not viable.

The fill value is viable, and pretty easy to implement. I would just need to make a decorator that could be used to filter out operations that require a zero fill value, and could be used like this:

@zero_fill_value
dot(a, b)
    ...
fujiisoup commented 6 years ago

I recently raised an issue of nan-aggregation in numpy/numpy#10456, but it looks a little hard to implement nan-aggregations without copy.

hameerabbasi commented 6 years ago

The main issue I see with the three-argument version of where is three-way broadcasting of sparse arrays. You have to match 7 combinations of "Is it in this one's dimensions but not the other one's?"

If someone can come up with an easier way to do N-way broadcasting of sparse arrays rather than just repeating them over and over... I'd be all for it. Broadcasts in Numpy are views... Here they're separate objects (repeats), with all the memory and performance losses that come with that.

Edit: Also the performance speedups when one of the operators returns zeros when one side is zero (like and and multiply) are significant.

Edit 2: Currently, I've optimised _elemwise_binary so that it doesn't have to repeat them. That's hard to do for N-way broadcasts, which is the main issue here.

hameerabbasi commented 6 years ago

I just realized there is a complicated (yet, hiding in plain sight) way to recursively reduce an N-way broadcast down into a 2-way/1-way (without losing out on any performance benefits). It will also simplify the code hugely. I'll try it today and tomorrow, maybe I'll come up with something genius.

shoyer commented 6 years ago

The main issue I see with the three-argument version of where is three-way broadcasting of sparse arrays. You have to match 7 combinations of "Is it in this one's dimensions but not the other one's?"

In practice broadcasting for where is often trivial, with one or both of the second and three arguments as scalars, e.g., where(x < 0, np.nan, x). But it's nice to support broadcasting for completeness, even if it will be slower.

hameerabbasi commented 6 years ago

The broadcasting is turning out to be more complex than I anticipated (it took the most of last weekend).

shoyer commented 6 years ago

@hameerabbasi what exactly is a involved in the "match" you're proposing?

My initial thought is that an algorithm for this could simultaneously iterate over all input arguments exactly once each. Assuming sorted coordinates, this looks something like:

# warning: untested!
iter_coords = [iter(input.coords) for input in inputs]
iter_datas = [iter(input.data) for input in inputs]
indices = [next(it) for it in iters]
result_coords = []
result_data = []
num_finished = 0
while True:
    current_index = min(indices)
    current_data = [next(it_data) if current_index == index else input.fill_value
                    for it_data, index, input in zip(iter_data, indices, inputs)]
    result_coords.append(current_index)

    # in reality, we probably save an explicit indexer array in each input's data,
    # using -1 for fill-values, and then compute the result data once by applying
    # it to 1D vectors
    result_data.append(func(*current_data))

    next_indices = []
    for it, index in zip(iter_coords, indices):
        if current_index == index:
            index = next(it, default=None)
            if index is None:
                num_finished += 1
                if num_finished == len(inputs):
                    break
        next_indices.append(index)

This would runtime O(NK), where N=sum(input.coords.size for input in inputs) and K=len(inputs). Of course, it would need to be implemented in something fast (e.g., Cython, Numba, C++) to be viable.

hameerabbasi commented 6 years ago

Since my last comment was a mishmash of incoherent half-formed thoughts, I'm re-commenting.

This is much simpler than what I had in mind, and can be easily extended to a broadcasting scenario. The main problem I see with something like this is the following:

Consider two arrays a and b. Here, a.shape = (10**5,), and b.shape = (10**5, 10**5). Also, a.nnz = b.nnz = 1. Now suppose I'm computing a * b. We currently optimize this by noticing that func(0, b.data) = 0 and func(a.data, 0) = 0. Therefore, we compute the "full matches", and don't compute the "partial matches". If you were to calculate a * b and a + b in this situation, you would see the speed difference.

If this optimization can somehow be ported to your method (I don't see how, unfortunately, while keeping broadcasting), I'd be perfectly willing to write up a Cython equivalent of it.

To answer your original question, a "match" is a calcultation of where input 1 is nonzero, input 2 is zero, and so on... For every possible combination.

shoyer commented 6 years ago

Consider two arrays a and b. Here, a.shape = (10**5,), and b.shape = (10**5, 10**5). Also, a.nnz = b.nnz = 1. Now suppose I'm computing a * b. We currently optimize this by noticing that func(0, b.data) = 0 and func(a.data, 0) = 0. Therefore, we compute the "full matches", and don't compute the "partial matches". If you were to calculate a * b and a + b in this situation, you would see the speed difference.

It seems like the solution we need here is a "sparse broadcast" function, which only repeats non-zero indices in the result. . The algorithm would look something like:

"sparse" vs "regular" broadcasting could be chosen based on the nature of the operator being applied (e.g., * vs +)

hameerabbasi commented 6 years ago

That's (roughly) what I'm doing with my matching setup. I do the following (in psuedocode):

iterate through all possible combinations of zero/nonzero inputs:
    match nonzero inputs' coordinates (find coordinates that match)
    find func(corresponding_data, zeros)
    filter zeros from func (and corresponding coordinates) (this does the optimization I describe)
    match matched coordinates with with every zero-input and filter out those
    Broadcast leftover coords/data
    add coords, data to a list

concatenate all coords and data lists

Edit: The matching setup is on a branch on my fork, it isn't in the main codebase, and it isn't in working condition yet. It needs more work.

Edit 2: Choosing it based on the operator is still troublesome for some edge cases involving np.inf or np.nan.

shoyer commented 6 years ago

I don't think it's necessary to iterate through every possible combination of zero/nonzero inputs. Maybe something like:

If you are concerned about the (constant factor) overhead for explicit broadcasting in cases like 2 * x, you could add a shortcut that converts these operators into "unary" operators that are only applied to a single sparse array.

hameerabbasi commented 6 years ago

What I'm more concerned about is:

Some things that could be made independent of the implementation:

To be perfectly clear, I would prefer your approach much more (it is undeniably more performant in the "dense" broadcasting case) if these edge-cases etc. could be nicely handled automatically.

shoyer commented 6 years ago

For example, fill-values will be affected by this (it won't generalize to arbitrary fill values).

The property you need here is op(fill_value, x) == fill_value for all x (or op(x, fill_value) == fill_value). In general this only holds for particular combinations of op and fill_value, e.g.,

So sure, the way this works is fill-value specific, but you could store the list of these identities and operations in a table.

If there's an nan or inf in there, the values won't match Numpy (because np.nan 0 = np.nan and np.inf 0 = np.nan, thus we need "dense" broadcasting in this case). To fix this, we would have to make the "switch" between the two approaches more complex.

This feels like a separate issue to me. How do you handle this in your current approach?

One option is to keep track of the finiteness of sparse arrays, e.g., by adding a (possibly cached) all_finite property.

It is guaranteed to not kick in for user-defined functions. For example, sparse.elemwise(lambda x,y:x*y, x, y).

I think I finally understand your approach: you are actually computing func(inputs[0].data, 0), func(0, inputs[1].data), etc., alongside non-zero elements in both arrays. This feels little crazy to me if you can know it statically :).

I would suggest adding an Operation class for wrapping functions applied to sparse arrays that could add information like the types of identities it obeys. Builtin operations and NumPy ufuncs could be automatically be mapped to the appropriate Operation. You basically need this anyways if you want to let users write their own versions of functions like np.sin().

hameerabbasi commented 6 years ago

This feels little crazy to me if you can know it statically :).

The reasoning behind this was exactly that: It can be determined dynamically, instead of writing out identities like the ones you suggest for practically every ufunc (which would add a LOT of overhead). Sure, there's a performance cost there, but it's automated. :-)

Maintainability and performance have to be balanced. We would need to maintain a huge list of such identities, and still won't hit every case.

If we were to adopt this approach, we would need to list out a lot of identities and come up with ways to check them while deciding on sparse vs dense broadcasting. In this case, we would need to check for absence of nan and inf. I guess some sort of automatic checker could be made.

For users, it would also be harder to define operations. Rather than defining a simple callable, they would need to go through the process of implementing Operation, defining identities.

The extension to fill values is simple. You calculate something like out_fill_value = func(all_fill_values), then compare if func(some_fill_values, some_other_values) = out_fill_value.

Implementing Operation feels a bit out of scope to me for this library, that just aims to mimic Numpy for sparse arrays.

To be clear: My approach is only exponential in the number of inputs, but linear in the number of output nonzeros (Well, n log n right now, but that can be fixed with Cython).

shoyer commented 6 years ago

@hameerabbasi I can see the merits in dynamically discovering which operations preserve zeros. Exponential in the number of inputs should be perfectly fine, so I don't really have an objection here :).

Some of the edge cases in the algorithm you've proposed don't quite make sense to me yet, but maybe it will be clear when you write it in code. In particular, how would you handle a three argument function f(x, y, z) when only one argument is 0? To evaluate f(x, y, 0), you need to know already how to "broadcast" x and y against each other. I guess you already evaluated f(x, 0, 0) and f(0, y, 0), so you could build up this knowledge in a loop (like dynamic programming).

hameerabbasi commented 6 years ago

@shoyer In the case f(x, y, 0), we do the following:

In general, it's more like this:

You can see the actual code here, in particular this function which does the actual "freezing". I don't recommend diving into the other broadcasting code, as it's a bit of a mess and needs some spring cleaning and a bit of algorithmic fixing.

It's not final (I hope to make it so this weekend), but you'll get the Gist of how I filter out scalars by "freezing" them into the function before running the actual algorithm.

hameerabbasi commented 6 years ago

If you're asking how we "match" for a particular combination of zero/nonzero inputs, we:

At the end, we simply concatenate all possible data/coords and return the result.

hameerabbasi commented 6 years ago

I've implemented N-ary broadcasting following the algorithm here over in #98. Feedback welcome. :)

mrocklin commented 6 years ago

OK, I'm hearing the following thoughts (paraphrasing):

These both seem like reasonable positions to me. I wonder if we can accomplish both by having optional parameters to elemwise that specify which inputs propagate zeros

elemwise(operation.mul, x, y, z, zero_prop=[0, 1, 2])
or
elemwise(operation.add, x, y, z, zero_prop=[])

With perhaps values like True -> range(len(args)) and False -> [] (also, please don't use the name zero_prop, there are certainly better options)

This is only an example of such an argument. There are many more complex situations that might arise, but I suspect that this handles the common case.

I do think that some static information like this is likely to be desirable when we start profiling. Also, I suspect that most use of elemwise will be by expert users (like us when we're using elemwise internally), and so providing this information might be easy.

mrocklin commented 6 years ago

Alternativley, what is our current support for operations that have non-zero return values when one of the inputs is zero? My guess was that we erred. If this is the case then can we just ignore the situation where we produce values on zero-valued inputs? Am I understanding things correctly?

hameerabbasi commented 6 years ago

I do think that some static information like this is likely to be desirable when we start profiling. Also, I suspect that most use of elemwise will be by expert users (like us when we're using elemwise internally), and so providing this information might be easy.

I agree 100% with this. But (at the risk of repeating myself), this ignores the NaN/inf problem with mul. While doing this, we would have to filter/check for those, or risk inconsistency in these cases as compared to Numpy. Also, it doesn't work for arbitrary fill values. Also that refactor I did in #84 will need to be partially reversed.

Alternativley, what is our current support for operations that have non-zero return values when one of the inputs is zero? My guess was that we erred. If this is the case then can we just ignore the situation where we produce values on zero-valued inputs? Am I understanding things correctly?

Currently, we err iff func(all_zeros) will produce a nonzero. If zeros/nonzeros are mixed (like in add), we don't err but compute func(zero, other.data) etc. Here, I've just generalized that behavior to multiple operands. The plan is to make it so that func(all_fillvalues) becomes a fill-value for the return array.

mrocklin commented 6 years ago

@shoyer I don't suppose you happen to have a TestCase for what an XArray container class should support do you? If so it would be good to import that into our test suite to track progress in the future between the projects.

shoyer commented 6 years ago

No, not yet, unfortunately. Historically, the API for xarray container classes has evolved in a dynamic way based on what we needed for numpy + dask. We'll need to do some internal clean-up before we can support sparse arrays well.

On Wed, Feb 21, 2018 at 6:24 AM Matthew Rocklin notifications@github.com wrote:

@shoyer https://github.com/shoyer I don't suppose you happen to have a TestCase for what an XArray container class should support do you? If so it would be good to import that into our test suite to track progress in the future between the projects.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/sparse/issues/1#issuecomment-367342228, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1nYsOS6VNgx8ninOXeEf-QPNHwL-ks5tXCc5gaJpZM4M-1J4 .

hameerabbasi commented 6 years ago

Feel free to ping me or involve me in the project. :-)

hameerabbasi commented 6 years ago

Three-argument version of where and NaN-skipping aggregations are in!

mrocklin commented 6 years ago

@shoyer what is the right thing for folks to push on here if they have time? The multipledispatch VarArgs conversation? Internal fixes to XArray? A test suite?

hameerabbasi commented 6 years ago

My intuition would be the multipledispatch VarArgs conversation. We should do things properly. 😄 I don't mind it taking time, it'll save time and headache down the road. Of course, if the consensus is on another solution, I'd happily work on that, too.

shoyer commented 6 years ago

VarArgs will be needed for finishing this in xarray, but really only for a few functions (concatenate and stack). So I think we could simultaneously work on:

  1. Getting VarArgs of some form into multipledispatch.
  2. Porting as much of the existing xarray dispatch machinery for NumPy and Dask to use multipledispatch. This has at least two components: the wrappers in xarray/core/duck_array_ops.py, and implementing __array_ufunc__ (https://github.com/pydata/xarray/issues/1617).
hameerabbasi commented 6 years ago

Since the final review deadline for the papers in SciPy 2018 is June 18, I thought I'd revive this thread. I'd love to show off Dask/Sparse and XArray/Sparse integration in the paper and at SciPy 2018.

cc @mrocklin @shoyer If you need anything done on my end, or anything at all you think I'm capable of that'd speed this along (Including contributing to XArray or Dask, I'd love to dip my feet in), don't hesitate to say so.

I looked into multipledispatch but VarArgs seems like something better done by someone else. arrayish is already up, if any changes are required there, let me know.

mrocklin commented 6 years ago

Experiments with dask.array and sparse would be welcome. I think that this should work today (we do include some tests here in dask.array's CI). It might be interesting to do profile a few workflows. I'll think a bit about what these might be.

For XArray my guess is that the right thing to do is to follow @shoyer 's two paths listed just above. Looking at duck_array_ops.py and how that functionality gets used within XArray seems like a sensible place to start, though @shoyer may have other suggestions (I don't know this codebase as well). My guess is that there some work to get through here, but probably nothing that is too controversial.

mrocklin commented 6 years ago

It also looks like the conversation in https://github.com/mrocklin/multipledispatch/issues/72 ended in a nice place. Again, it seems like consensus was reached and now this requires some hopefully straightforward work.

hameerabbasi commented 6 years ago

I’ve implemented outer indexing for a single array, but not multiple. When NumPy implements oindex and vindex, we’ll follow suit. I somehow don’t feel like implementing every single fancy index case with its weird behaviour.

khaeru commented 3 years ago

Not sure if this is the right place to mention, but it doesn't appear elsewhere in issues/PRs on the repo.

I ran into the following trying to call .shift() on a sparse.COO-backed DataArray:

../../.local/lib/python3.8/site-packages/xarray/core/dataarray.py:3077: in shift
    variable = self.variable.shift(
../../.local/lib/python3.8/site-packages/xarray/core/variable.py:1205: in shift
    result = result._shift_one_dim(dim, count, fill_value=fill_value)
../../.local/lib/python3.8/site-packages/xarray/core/variable.py:1166: in _shift_one_dim
    data = duck_array_ops.pad(
../../.local/lib/python3.8/site-packages/xarray/core/duck_array_ops.py:56: in f
    return wrapped(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<COO: shape=(2, 5), dtype=float64, nnz=4, fill_value=nan>, [(1, 0), (0, 0)]), kwargs = {'constant_values': nan, 'mode': 'constant'}
relevant_args = (<COO: shape=(2, 5), dtype=float64, nnz=4, fill_value=nan>,)

>   ???
E   TypeError: no implementation found for 'numpy.pad' on types that implement __array_function__: [<class 'sparse._coo.core.COO'>]

<__array_function__ internals>:5: TypeError

There were other similar cases. So far I've noticed:

hameerabbasi commented 3 years ago

@khaeru Do you have use-cases for these for sparse arrays? If so, please raise separate issues for each missing function.

In a lot of cases, one can’t achieve better performance on these than dense arrays.

khaeru commented 3 years ago

@hameerabbasi the concern is less performance than being able to use the DataArray API as it is defined without a bunch of extra wrapper code—what I understood as the topic of this issue, given the title.

My current mitigation is to have a subclass that allows some of these calls by converting to dense/numpy.ndarray-backed DataArray, performing the operation, then re-converting to sparse.COO-backed. I know that subclassing the xarray classes is not encouraged, I would be very happy to not have to do that :)

I'll do as you suggest and open an issue for one example.