Open ezyang opened 7 years ago
Thanks @ezyang for opening the discussion.
Proposal 1: I'm wondering if it would be ok to automatically coalesce sparse tensors when calling indices()
or values()
in Python. This way the user always has a consistent view of the tensor, and doing s.values().pow_()
is not that much of an issue anymore (note: __repr__
also coalesces sparse tensors for that reason). As long as those functions are not called, tensors would still remain uncoalesced for efficiency reasons: for example in cadd
when aggregating gradients. Oh and you're right that my adagrad code can be simplified, not sure why I did it this way in the end :)
Proposal 2: you're right that we have to be careful with sparse operations, because many of them don't really make sense (like addition with a scalar), or are ill-defined. This is why I chose to rely on the existing well-defined operations as much as possible (Proposal 1), but if some patterns come up often we can add them to the library.
Proposal 3: how would you solve the grad_values / std_values
problem? If we wanted to do std.lift_sparse2(lambda x, y: x / y)
we'd need to "upgrade" the values tensors of x
and y
so that they have the same support.
It looks like #1285 can be used as a good reference for our needs in the immediate future.
coaelescing them on access to indices
or values
sounds quite reasonable.
@martinraison @soumith Are you suggesting an in-place coalesce? @adamlerer considered this in #1302 but he came down on the side of not doing in-place coalesce. The key situation he was worried about was coalescing related bugs which went away when you called print on a tensor, since print calls indices/values (which would induce a coalesce.) I suppose we could add a "raw_indices/raw_values" which don't call coalesce and make sure debugging operators call those instead.
However, even if indices/values always coalesce, a user can still end up in a situation where they have a pointer to a tensor which is not coalesced. The simplest way is to grab t.values()
, and then do an in-place cadd on t
. Because t.values()
refers to the same underlying tensor as t
, when t
becomes uncoalesced, the change becomes visible. I suppose we could solve this problem by changing the semantics of values/indices to clone (or deciding we don't care) but I think it's worth thinking about.
Proposal 3: how would you solve the
grad_values / std_values
problem? If we wanted to dostd.lift_sparse2(lambda x, y: x / y)
we'd need to "upgrade" the values tensors of x and y so that they have the same support.
Yeah, it's the same situation as in Proposal 2. I think we don't actually want to upgrade the values tensors so they have the same support; we just want to check that they have the same support and bug out if they don't, which hopefully is cheaper. Having copy-on-write index arrays would go a long way towards this. Perhaps we should consider adding those?
For proposal 2, wouldn't both be solved by simply overloading sparse div to be x / (y + eps)
in all cases? This way would not add new names, and simply the code in one go. The only difference is that the user has to remember this detail (but it's seems like this would be the only sane way to do element-wise division anyway). pow can also be implemented, the same way.
The sparse route now becomes:
if p.grad.data.is_sparse:
grad.contiguous() # or coalesce
state['sum'].add_(grad.pow(2)) # This just pow's on the values
std = state['sum'].sparse_mask(grad)
p.data.add_(-clr, grad.cdiv(std.sqrt_(), 1e-10))
I think the extra copy is sort of "unavoidable", since it's the same as trying to implement kernel fusion in a sense.
I don't think adding a lift operator is a good idea. They work well in Haskell, because the compiler can inline the function (calling a Py function for each elem will be super costly - see map_
) + it has a strong type system, so it can check if it has the right signature. It seems to me that it's going to be very easy to give it an incompatible one.
@ezyang I didn't see #1302 - cool stuff! If we're changing coalesce
to operate out-of-place everywhere then sure, we probably don't want to do anything behind the scenes in indices
and values
, that would defeat the purpose.
@ebetica my concern is that mathematically speaking, x / y
is infinite or nan when y
has zero values. I feel that changing that semantic would be confusing, and doing x.values().div_(y.values())
when you know that your values are compatible doesn't seem too bad.
@apaszke I don't think the suggestion was to make a call for each element, but only to check that the values tensors have the same support before feeding them to the lambda (@ezyang correct me if I'm wrong). But it seems like an expensive operation with a small benefit, so I don't really think it is the best idea either.
@martinraison ah I must have misunderstood something, in fact that seems like a nice solution. I think a mix of 2. and 3. would be the ok - implement all methods that retain sparsity (e.g. sqrt_()
, but not add_
), and use lifting in situations like the one in Adagrad.
I'm also curious how much having uncoalesced tensors wins us 😕
Here are my thoughts:
make_sparse
should not be necessary.values()
.add_
call here directly on the values is a special case; it only makes sense here because we're then going to multiply through by grad so it's okay to only add to the non-zero entries. I think we should leave this to the user to sort out.
4a. I think the add_
is only here to prevent divide by zero, so I don't think we need it in the sparse case.
Here's how the code would look with these minor changes (1 and 2). Still a bit longer, but mostly just because of the "unusual" parts (optimized add_
and div
) that should be done by hand.p.grad.coalesce
first (yikes!). Does sparse_mask
have a contract about whether it will coalesce its indices?if p.grad.data.is_sparse:
grad = p.grad
state['sum'].add_(grad.pow(2))
std_values = state['sum'].sparse_mask(grad).values().sqrt_()
# warning! this is not safe because `sparse_mask` or `sqrt` may have coalesced the gradient!
p.data.add_(-clr, grad.data.new(grad.indices(), grad.values() / std_values))
else:
state['sum'].addcmul_(1, grad, grad)
std = state['sum'].sqrt().add_(1e-10)
p.data.addcdiv_(-clr, grad, std)
P.S. You might want to measure the performance of adagrad with my benchmark script (https://gist.github.com/adamlerer/4159ddb4bd19e87de1b4bb3215a1db6f) ; if it does coalescing under the hood it could be much slower.
Re @martinraison #1302 makes coalesce
out-of-place, and so removes the coalesce from __repr__
. coalesce
is quite slow (requires host sync) so if you're going to call it before doing indices
it means that users can't do any efficient operations directly on indices/values (as shown here) without paying that performance penalty.
I wonder if we should look deeply into making coalesce really fast so that we can guarantee all tensors to always be coalesced without a performance hit. This will require some work to investigate and might not be possible.
@adamlerer the idea of storing nnz
on the GPU crossed my mind but it seems like most operations would need to synchronously fetch it anyway. For example if you do a cadd
, you'll want to allocate a new values tensor with first dimension nnz1 + nnz2
, which only the GPU knows about. So either we can allocate this tensor asynchronously from the device (does that even make sense? I have no idea), either we have to synchronously fetch nnz1
and nnz2
first before doing anything else.
@martinraison I think x.div(y) with y sparse should just not work, but you can add x.div(y, 1e-10) for the non-nan style division. In fact, you can do it for dense tensors too, since this seems quite a common operation sometimes.
@adamlerer
Does sparse_mask have a contract about whether it will coalesce its indices?
It does not, and in fact does not work correctly in this case:
>>> z
SparseFloatTensor of size 2 with indices:
0 0
[torch.LongTensor of size 1x2]
and values:
3
2
[torch.FloatTensor of size 2]
>>> y
3
2
[torch.FloatTensor of size 2]
>>> y.sparse_mask(z)
SparseFloatTensor of size 2 with indices:
0 0
[torch.LongTensor of size 1x2]
and values:
3
3
[torch.FloatTensor of size 2]
Yikes! Can you do the following to get this in working order:
sparseMask
(both in CPU and GPU code)p.grad.data = p.grad.data.coalesce()
at the beginning of sparse adagrad code or that will be broken too(unless you have another plan for rationalizing all this stuff)
Funnily enough, the current behavior of sparse_mask is documented in the test suite (in the hard-coded input/output test cases), which I noticed when I "fixed" this bug. Which lead to a second problem: the output of sparse_mask isn't really well defined if we internally coalesce the indices before doing the mask, because usually sparse_mask is intended to be used in conjunction with s.indices()
, which would still be in uncoalesced form.
It seems to me a better solution is assert as a precondition that the input to sparse_mask must be coalesced, and raise an error if it is not.
EDIT: But this doesn't work either, because is_coalesced
is not a very accurate indicator of coalescedness:
>>> x = torch.sparse.FloatTensor(i,v)
>>> x
FloatTensor with indices:
3 5
[torch.LongTensor of size 1x2]
and values:
0
2
[torch.FloatTensor of size 2]
>>> x.is_coalesced()
False
>>> y = x.coalesce()
>>> y.indices().fill_(0)
0 0
[torch.LongTensor of size 1x2]
>>> y.is_coalesced()
True
>>> y
FloatTensor with indices:
0 0
[torch.LongTensor of size 1x2]
and values:
0
2
[torch.FloatTensor of size 2]
I had a discussion with @soumith about this on Friday, but I wanted to record things down here so that I don't forget.
Background. Imagine that you are @martinraison and you are implementing sparse Adagrad. You end up writing this code:
So, as you can see, the sparse version of this code is four times as long as the non-sparse version. Why? There are a few contributory factors:
addcmul_
doesn't understand how to handle sparse arguments, so we have to compute the sparse tensor we want to add, and then add it in. In the code above, this is done by projecting out the underlying values, performing the necessary operations, and then constructing a new sparse tensor (usingmake_sparse
)The line
state['sum'].sqrt().add_(1e-10)
works in both the sparse and dense settings, becausestate['sum']
is always a dense tensor. However, we can see that we will subsequently use this to divide the gradient, which means we don't actually need to compute all of the entries, only those which will divide non-zero coefficients of gradient. Once again, this is done by pulling out the values you need, doing the necessary ops, and then turning it back into a sparse vector.OK, so what can we do about this.
Proposal 1: Nothing is wrong, carry on. The code above is wordy, but some of it can be simplified by clone()'ing the sparse tensor, and then performing the updates on values in place. (@soumith suggested this to me.) For example, it's an easy matter to rewrite
state['sum'].add_(make_sparse(grad_values.pow(2)))
to:Which avoids making use of the
make_sparse
function.There are some downsides to this approach:
You have to be careful to make sure your sparse tensor is coalesced; an uncoalesced tensor may have multiple entries for an index in its values array; if we naively apply an operation to the values tensor of an uncoalesced tensor, that is bad. Indeed, the adagrad code initially got this wrong, and it was fixed in 1018b238acf2ecfc836484f3153c2864e9f4e963/
There are two minor performance hits to doing things this way. By running
grad.clone()
you are forced to make a copy of values and indexes. The values copy is wasted, because you're going to immediately do an inplace update; you should have just directly written in the updated values. Depending on what you do to the sparse tensor, the index copy may also be wasted, because you may not ever actually change the shape of the tensor (as far as I can tell, we don't have any copy on write.)I get really nervous when performing mutations on tensors which are part of the internal representations of others, e.g., as occurs in
s.values().pow_()
. Yes, it is pretty clear what the intent is in this code snippet, but I think when there is an indirection, it's easier to accidentally end up sharing state when you don't want to. Then eek!Proposal 2: Add lots of methods. If we want to get users to avoid mucking about
values()
manually, the next obvious thing to do is add methods to sparse tensor to let them do the things they want to do. This is a clear win for well defined, easy-to-compute operations on sparse tensors likepow(scalar)
andsqrt
. We could implement them in Python or hand-write kernels for them (my personal inclination is to implement them in Python, since that reduces the TCB, but @soumith tells me there are good reasons to implement kernels for each of them.)However, there are some pitfalls.
Consider
std_values = std.values().sqrt_().add_(1e-10)
. It is tempting to say that we should replace this line withstd.sqrt_().add_(1e-10)
, implementing appropriate sparse versions of sqrt and add. But now there is a problem: the "sparse" version of add... isn't an add at all! If you add a scalar to a sparse tensor, the result isn't going to be sparse anymore! The operation you really wanted here was "add scalar if we actually have an entry for it in the sparse vector" which is a kind of weird operation, and in any case really shouldn't be called add. (To add insult to injury, it's not even "add scalar if entry is nonzero" because, prior to coalescing, we might very well have redundant zero entries in our sparse tensor!) So really, if we add a method for this operation, we should not call it add; perhaps we should name itsadd
("sparse add"). This highlights a merit of Proposal 1, which is that we didn't have to come up with new names for the operations.Let's look more closely at
grad_values / std_values
. We can do this division because we know that the indexes of grad and std are the same. If we are defining a (sparse!) division between two sparse tensors, we do not necessarily know if this is the case: the sparse tensors might be on completely different indexes. (Actually, I don't even know what the intended semantics of division should be in this case.) Nor can I think of a cheap way of testing that this is the case, since if you were usingclone()
object equality no longer holds between the indexes.Proposal 3: A lifting operator (or two). One final possibility is to define a lifting function, which takes a method invocation on dense tensors, and lifts it into a sparse variant. For example, we might rewrite
std_values = std.values().sqrt_().add_(1e-10)
asstd.lift_sparse(lambda x: x.sqrt_().add_(1e-10))
. Similarly,torch.lift_sparse(torch.div, std, grad)
would take division on dense tensors and lift it to apply on sparse vectors, handling the necessary unwrapping and wrapping, and ensuring that the inputs are coalesced.@soumith didn't really like this idea, because when you provide an operator like this, users may end up using it in ways that you don't expect, leading to strange errors on their end. I know lifting operators like this work quite nicely in Haskell, but it is certainly true that Python is not Haskell.
Thanks for reading. Interested to hear your thoughts.