"Sparsified" mathematical operations

ezyang commented 7 years ago

I had a discussion with @soumith about this on Friday, but I wanted to record things down here so that I don't forget.

Background. Imagine that you are @martinraison and you are implementing sparse Adagrad. You end up writing this code:

            if p.grad.data.is_sparse:
                grad_indices = grad.indices()
                grad_values = grad.values()
                size = torch.Size([x for x in grad.size()])

                def make_sparse(values):
                    constructor = type(p.grad.data)
                    if grad_indices.dim() == 0 or values.dim() == 0:
                        return constructor()
                    return constructor(grad_indices, values, size)
                state['sum'].add_(make_sparse(grad_values.pow(2)))
                std = state['sum'].sparse_mask(grad)
                std_values = std.values().sqrt_().add_(1e-10)
                p.data.add_(-clr, make_sparse(grad_values / std_values))
            else:
                state['sum'].addcmul_(1, grad, grad)
                std = state['sum'].sqrt().add_(1e-10)
                p.data.addcdiv_(-clr, grad, std)

So, as you can see, the sparse version of this code is four times as long as the non-sparse version. Why? There are a few contributory factors:

addcmul_ doesn't understand how to handle sparse arguments, so we have to compute the sparse tensor we want to add, and then add it in. In the code above, this is done by projecting out the underlying values, performing the necessary operations, and then constructing a new sparse tensor (using make_sparse)
The line state['sum'].sqrt().add_(1e-10) works in both the sparse and dense settings, because state['sum'] is always a dense tensor. However, we can see that we will subsequently use this to divide the gradient, which means we don't actually need to compute all of the entries, only those which will divide non-zero coefficients of gradient. Once again, this is done by pulling out the values you need, doing the necessary ops, and then turning it back into a sparse vector.

OK, so what can we do about this.

Proposal 1: Nothing is wrong, carry on. The code above is wordy, but some of it can be simplified by clone()'ing the sparse tensor, and then performing the updates on values in place. (@soumith suggested this to me.) For example, it's an easy matter to rewrite state['sum'].add_(make_sparse(grad_values.pow(2))) to:

s = grad.clone()
s.values().pow_(2)
state['sum'].add_(s)

Which avoids making use of the make_sparse function.

There are some downsides to this approach:

You have to be careful to make sure your sparse tensor is coalesced; an uncoalesced tensor may have multiple entries for an index in its values array; if we naively apply an operation to the values tensor of an uncoalesced tensor, that is bad. Indeed, the adagrad code initially got this wrong, and it was fixed in 1018b238acf2ecfc836484f3153c2864e9f4e963/
There are two minor performance hits to doing things this way. By running grad.clone() you are forced to make a copy of values and indexes. The values copy is wasted, because you're going to immediately do an inplace update; you should have just directly written in the updated values. Depending on what you do to the sparse tensor, the index copy may also be wasted, because you may not ever actually change the shape of the tensor (as far as I can tell, we don't have any copy on write.)
I get really nervous when performing mutations on tensors which are part of the internal representations of others, e.g., as occurs in s.values().pow_(). Yes, it is pretty clear what the intent is in this code snippet, but I think when there is an indirection, it's easier to accidentally end up sharing state when you don't want to. Then eek!

Proposal 2: Add lots of methods. If we want to get users to avoid mucking about values() manually, the next obvious thing to do is add methods to sparse tensor to let them do the things they want to do. This is a clear win for well defined, easy-to-compute operations on sparse tensors like pow(scalar) and sqrt. We could implement them in Python or hand-write kernels for them (my personal inclination is to implement them in Python, since that reduces the TCB, but @soumith tells me there are good reasons to implement kernels for each of them.)

However, there are some pitfalls.

Consider std_values = std.values().sqrt_().add_(1e-10). It is tempting to say that we should replace this line with std.sqrt_().add_(1e-10), implementing appropriate sparse versions of sqrt and add. But now there is a problem: the "sparse" version of add... isn't an add at all! If you add a scalar to a sparse tensor, the result isn't going to be sparse anymore! The operation you really wanted here was "add scalar if we actually have an entry for it in the sparse vector" which is a kind of weird operation, and in any case really shouldn't be called add. (To add insult to injury, it's not even "add scalar if entry is nonzero" because, prior to coalescing, we might very well have redundant zero entries in our sparse tensor!) So really, if we add a method for this operation, we should not call it add; perhaps we should name it sadd ("sparse add"). This highlights a merit of Proposal 1, which is that we didn't have to come up with new names for the operations.
Let's look more closely at grad_values / std_values. We can do this division because we know that the indexes of grad and std are the same. If we are defining a (sparse!) division between two sparse tensors, we do not necessarily know if this is the case: the sparse tensors might be on completely different indexes. (Actually, I don't even know what the intended semantics of division should be in this case.) Nor can I think of a cheap way of testing that this is the case, since if you were using clone() object equality no longer holds between the indexes.

Proposal 3: A lifting operator (or two). One final possibility is to define a lifting function, which takes a method invocation on dense tensors, and lifts it into a sparse variant. For example, we might rewrite std_values = std.values().sqrt_().add_(1e-10) as std.lift_sparse(lambda x: x.sqrt_().add_(1e-10)). Similarly, torch.lift_sparse(torch.div, std, grad) would take division on dense tensors and lift it to apply on sparse vectors, handling the necessary unwrapping and wrapping, and ensuring that the inputs are coalesced.

@soumith didn't really like this idea, because when you provide an operator like this, users may end up using it in ways that you don't expect, leading to strange errors on their end. I know lifting operators like this work quite nicely in Haskell, but it is certainly true that Python is not Haskell.

Thanks for reading. Interested to hear your thoughts.

martinraison commented 7 years ago

Thanks @ezyang for opening the discussion.

Proposal 1: I'm wondering if it would be ok to automatically coalesce sparse tensors when calling indices() or values() in Python. This way the user always has a consistent view of the tensor, and doing s.values().pow_() is not that much of an issue anymore (note: __repr__ also coalesces sparse tensors for that reason). As long as those functions are not called, tensors would still remain uncoalesced for efficiency reasons: for example in cadd when aggregating gradients. Oh and you're right that my adagrad code can be simplified, not sure why I did it this way in the end :)

Proposal 2: you're right that we have to be careful with sparse operations, because many of them don't really make sense (like addition with a scalar), or are ill-defined. This is why I chose to rely on the existing well-defined operations as much as possible (Proposal 1), but if some patterns come up often we can add them to the library.

Proposal 3: how would you solve the grad_values / std_values problem? If we wanted to do std.lift_sparse2(lambda x, y: x / y) we'd need to "upgrade" the values tensors of x and y so that they have the same support.

It looks like #1285 can be used as a good reference for our needs in the immediate future.

soumith commented 7 years ago

coaelescing them on access to indices or values sounds quite reasonable.

ezyang commented 7 years ago

@martinraison @soumith Are you suggesting an in-place coalesce? @adamlerer considered this in #1302 but he came down on the side of not doing in-place coalesce. The key situation he was worried about was coalescing related bugs which went away when you called print on a tensor, since print calls indices/values (which would induce a coalesce.) I suppose we could add a "raw_indices/raw_values" which don't call coalesce and make sure debugging operators call those instead.

However, even if indices/values always coalesce, a user can still end up in a situation where they have a pointer to a tensor which is not coalesced. The simplest way is to grab t.values(), and then do an in-place cadd on t. Because t.values() refers to the same underlying tensor as t, when t becomes uncoalesced, the change becomes visible. I suppose we could solve this problem by changing the semantics of values/indices to clone (or deciding we don't care) but I think it's worth thinking about.

Proposal 3: how would you solve the grad_values / std_values problem? If we wanted to do std.lift_sparse2(lambda x, y: x / y) we'd need to "upgrade" the values tensors of x and y so that they have the same support.

Yeah, it's the same situation as in Proposal 2. I think we don't actually want to upgrade the values tensors so they have the same support; we just want to check that they have the same support and bug out if they don't, which hopefully is cheaper. Having copy-on-write index arrays would go a long way towards this. Perhaps we should consider adding those?

ebetica commented 7 years ago

For proposal 2, wouldn't both be solved by simply overloading sparse div to be x / (y + eps) in all cases? This way would not add new names, and simply the code in one go. The only difference is that the user has to remember this detail (but it's seems like this would be the only sane way to do element-wise division anyway). pow can also be implemented, the same way.

The sparse route now becomes:

if p.grad.data.is_sparse:
  grad.contiguous()  # or coalesce
  state['sum'].add_(grad.pow(2))  # This just pow's on the values
  std = state['sum'].sparse_mask(grad)
  p.data.add_(-clr, grad.cdiv(std.sqrt_(), 1e-10))

I think the extra copy is sort of "unavoidable", since it's the same as trying to implement kernel fusion in a sense.

apaszke commented 7 years ago

I don't think adding a lift operator is a good idea. They work well in Haskell, because the compiler can inline the function (calling a Py function for each elem will be super costly - see map_) + it has a strong type system, so it can check if it has the right signature. It seems to me that it's going to be very easy to give it an incompatible one.

martinraison commented 7 years ago

@ezyang I didn't see #1302 - cool stuff! If we're changing coalesce to operate out-of-place everywhere then sure, we probably don't want to do anything behind the scenes in indices and values, that would defeat the purpose.

@ebetica my concern is that mathematically speaking, x / y is infinite or nan when y has zero values. I feel that changing that semantic would be confusing, and doing x.values().div_(y.values()) when you know that your values are compatible doesn't seem too bad.

martinraison commented 7 years ago

@apaszke I don't think the suggestion was to make a call for each element, but only to check that the values tensors have the same support before feeding them to the lambda (@ezyang correct me if I'm wrong). But it seems like an expensive operation with a small benefit, so I don't really think it is the best idea either.

apaszke commented 7 years ago

@martinraison ah I must have misunderstood something, in fact that seems like a nice solution. I think a mix of 2. and 3. would be the ok - implement all methods that retain sparsity (e.g. sqrt_(), but not add_), and use lifting in situations like the one in Adagrad.

I'm also curious how much having uncoalesced tensors wins us 😕

adamlerer commented 7 years ago

Here are my thoughts:

torch.sparse.XXXTensor should accept empty index/value tensors, so make_sparse should not be necessary.
We should add sparse versions of unary pointwise ops, that delegate to values().
Performing * or / directly on the values is just an optimization to avoid comparing indices; it would work even if you didn't do that. Maybe we need an unsafe direct divide (assumes that indices are equal), but I don't think so.
The add_ call here directly on the values is a special case; it only makes sense here because we're then going to multiply through by grad so it's okay to only add to the non-zero entries. I think we should leave this to the user to sort out. 4a. I think the add_ is only here to prevent divide by zero, so I don't think we need it in the sparse case. Here's how the code would look with these minor changes (1 and 2). Still a bit longer, but mostly just because of the "unusual" parts (optimized add_ and div) that should be done by hand.
I'm not sure this kernel is safe unless you call p.grad.coalesce first (yikes!). Does sparse_mask have a contract about whether it will coalesce its indices?

if p.grad.data.is_sparse:
    grad = p.grad
    state['sum'].add_(grad.pow(2))
    std_values = state['sum'].sparse_mask(grad).values().sqrt_()
    # warning! this is not safe because `sparse_mask` or `sqrt` may have coalesced the gradient!
    p.data.add_(-clr, grad.data.new(grad.indices(), grad.values() / std_values))
else:
    state['sum'].addcmul_(1, grad, grad)
    std = state['sum'].sqrt().add_(1e-10)
    p.data.addcdiv_(-clr, grad, std)

P.S. You might want to measure the performance of adagrad with my benchmark script (https://gist.github.com/adamlerer/4159ddb4bd19e87de1b4bb3215a1db6f) ; if it does coalescing under the hood it could be much slower.

adamlerer commented 7 years ago

Re @martinraison #1302 makes coalesce out-of-place, and so removes the coalesce from __repr__. coalesce is quite slow (requires host sync) so if you're going to call it before doing indices it means that users can't do any efficient operations directly on indices/values (as shown here) without paying that performance penalty.

I wonder if we should look deeply into making coalesce really fast so that we can guarantee all tensors to always be coalesced without a performance hit. This will require some work to investigate and might not be possible.

martinraison commented 7 years ago

@adamlerer the idea of storing nnz on the GPU crossed my mind but it seems like most operations would need to synchronously fetch it anyway. For example if you do a cadd, you'll want to allocate a new values tensor with first dimension nnz1 + nnz2, which only the GPU knows about. So either we can allocate this tensor asynchronously from the device (does that even make sense? I have no idea), either we have to synchronously fetch nnz1 and nnz2 first before doing anything else.

ebetica commented 7 years ago

@martinraison I think x.div(y) with y sparse should just not work, but you can add x.div(y, 1e-10) for the non-nan style division. In fact, you can do it for dense tensors too, since this seems quite a common operation sometimes.

ezyang commented 7 years ago

@adamlerer

Does sparse_mask have a contract about whether it will coalesce its indices?

It does not, and in fact does not work correctly in this case:

>>> z
SparseFloatTensor of size 2 with indices:

 0  0
[torch.LongTensor of size 1x2]
and values:

 3
 2
[torch.FloatTensor of size 2]

>>> y

 3
 2
[torch.FloatTensor of size 2]

>>> y.sparse_mask(z)
SparseFloatTensor of size 2 with indices:

 0  0
[torch.LongTensor of size 1x2]
and values:

 3
 3
[torch.FloatTensor of size 2]

adamlerer commented 7 years ago

Yikes! Can you do the following to get this in working order:

Put a coalesce at the beginning of sparseMask (both in CPU and GPU code)
Call p.grad.data = p.grad.data.coalesce() at the beginning of sparse adagrad code or that will be broken too
Add a test case for this

(unless you have another plan for rationalizing all this stuff)

ezyang commented 7 years ago

Funnily enough, the current behavior of sparse_mask is documented in the test suite (in the hard-coded input/output test cases), which I noticed when I "fixed" this bug. Which lead to a second problem: the output of sparse_mask isn't really well defined if we internally coalesce the indices before doing the mask, because usually sparse_mask is intended to be used in conjunction with s.indices(), which would still be in uncoalesced form.

It seems to me a better solution is assert as a precondition that the input to sparse_mask must be coalesced, and raise an error if it is not.

EDIT: But this doesn't work either, because is_coalesced is not a very accurate indicator of coalescedness:

>>> x = torch.sparse.FloatTensor(i,v)
>>> x
FloatTensor with indices:

 3  5
[torch.LongTensor of size 1x2]
and values:

 0
 2
[torch.FloatTensor of size 2]

>>> x.is_coalesced()
False

>>> y = x.coalesce()
>>> y.indices().fill_(0)

 0  0
[torch.LongTensor of size 1x2]

>>> y.is_coalesced()
True
>>> y
FloatTensor with indices:

 0  0
[torch.LongTensor of size 1x2]
and values:

 0
 2
[torch.FloatTensor of size 2]

pytorch / pytorch

"Sparsified" mathematical operations #1369