Open maxwellzh opened 3 years ago
Hello @maxwellzh,
Thank you for posting your question. Could you please detail what the expected output of this operation is you want to perform?
Thank you, Christian
I also post on pytorch forum, which might be more clear. https://discuss.pytorch.org/t/efficient-way-to-multiply-two-tensor-with-different-lengths/132029
Oh I see, so your code snippet is
A = tensor[(1., 2., 3., 1., 2., 1., 2., 3., 4.])
# len_A indicates the real length of each element in A
len_A = [3, 2, 4]
weight_A = tensor([0.2, 0.3, 0.5])
# how can I efficiently make the calculation without a loop
# 3, 5, 9 are computed with len_A
A[:3] *= weight_A[0]
A[3:5] *= weight_A[1]
A[5:9] *= weight_A[2]
So with NestedTensor you could achieve this using the constructor via
result_nt = (nestedtensor.nested_tensor([A[:3], A[3:5], A[5:9]) * weight_A)
result = result_nt.flatten() # possibly need to use a different operator
However, this would incur a copy. You'd probably likely rather want to have an API similar to as_strided
that you can pass a flat buffer and your lengths.
Do you need autograd for this operation? Please keep in mind that NestedTensor currently is inference only.
Thank you for your reply!
Autograd is not required. But I wonder if there is any way to avoid the slicing and copy of A
?
This might be somewhat off the nestedtensor topic, but I want to discuss a bit more:
One probable way is to exapnd weight_A
according to len_A
with functions torch.repeat_interleave and torch.gather, then weight_A
becomes [0.2,0.2,0.2,0.3,0.3,0.5,0.5,0.5,0.5]
and therefore could be directly multiplied to A
.
However, ideally, my expectation is making the somehow multiplying A
by weight_A
as efficient as multiplying an array by a scalar with broadcast. And broadcasting array operation might be more efficient.
Refers https://numpy.org/doc/stable/user/theory.broadcasting.html#array-broadcasting-in-numpy
@maxwellzh - So, effectively you need a segmented multiplication that you can pass lengths to avoid expanding memory. NestedTensor could do that for you, maybe via a function such as this:
nt = nestedtensor.as_nested(A, [torch.size(3), torch.size(1), torch.size(4)])
This will give you a NestedTensor that looks like this
NestedTensor([
[1, 2, 3],
[2],
[1, 2, 3, 4]
])
You then multiply this by weight_A
after reshaping, so
nt * weight_A.reshape(3, 1)
This will give you
NestedTensor([
[0.2, 0.4, 0.6],
[0.6],
[0.5, 1.0, 1.5, 2.0]
)]
You can then flatten this and get the result as
[0.2, 0.4, 0.6, 0.6, 0.5, 1.0, 1.5, 2.0]
Could this resolve your issue?
Note that we'd need to implement the following
Do you want to only run this on CPU or GPU also?
Thanks, Christian
Could I also ask you for your use case for this operation?
@maxwellzh - So, effectively you need a segmented multiplication that you can pass lengths to avoid expanding memory. NestedTensor could do that for you, maybe via a function such as this:
nt = nestedtensor.as_nested(A, [torch.size(3), torch.size(1), torch.size(4)])
This will give you a NestedTensor that looks like thisNestedTensor([ [1, 2, 3], [2], [1, 2, 3, 4] ])
You then multiply this by
weight_A
after reshaping, so
nt * weight_A.reshape(3, 1)
This will give you
NestedTensor([ [0.2, 0.4, 0.6], [0.6], [0.5, 1.0, 1.5, 2.0] )]
You can then flatten this and get the result as
[0.2, 0.4, 0.6, 0.6, 0.5, 1.0, 1.5, 2.0]
Could this resolve your issue?
Note that we'd need to implement the following
- as_nested
- NestedTensor.flatten
- An efficient kernel for this segmented multiplication
Do you want to only run this on CPU or GPU also?
Thanks, Christian
Hi @cpuhrsch ,
If the memory layout of A
and nt
is the same so allocation & copy are avoided, that's exactly what I want.
About the NestedTensor.flatten
OP, I think an in-place NestedTensor.flatten_
is also required.
As for my real use case, it's a speech recognition system impl. The inputs (e.g. the audio frames) to my model is of variable lengths. So generally, we pad the short sequences to meet the max length of the sequence in one mini-batch with zeros:
A_pad=[
[1, 2, 3, 0],
[2, 0, 0, 0],
[1, 2, 3, 4]]
However, paddings lead to extra memory overhead. In most of the speech recognition systems, these waste of memory is acceptable. But for a specific system, called RNN-Transducer, a super big tensor of size (N, T, U, V)
, which includes two dimensional paddings along T and U, is introduced in computation.
A such tensor with shape (16, 512, 128, 1024)
takes 16×512×128×1024×4Byte=4295MB
storage with single-precison float representation. So we have to avoid any unnecessary memory copy of the huge tensor.
Back to our topic, to reduce the memory occupations, I remove paddings and flatten it to shape (S, V)
, where S denotes total length of N batches real values. And here comes the problem, when I want to multiply the concated tensor by a weight of shape (N, )
.
Edit: The figure in this discussion might help to better understand. https://github.com/1ytic/warp-rnnt/pull/26 And currently, GPU only is OK in my case.
Ok great, this makes sense! In terms of timing I don't currently have much time available to work on this specifically, but I'll try to squeeze it in over the coming weeks (no promises!). For your case specifically, it seems that the hardest part is writing an efficient kernel and modifying the RNN-T code. Aside from maybe a nicer input interface as opposed to passing data and lengths separately, what else would you like to see from this? Or do you plan to use this for the entire acoustic model and then feed it into the loss?
Specifically, RNN-T loss with compact layout computes costs
with respect to each sequence, so costs
is of shape (N,)
.
In the backward pass, we need to multiply grad_costs
to get the gradients according to the chain rule:
grad: (S, V), also represented as (T0*(U0+1)+ T1*(U1+1)+..., V)
grad_costs: (N, )
grad_output = grad_costs ?* grad <- the somehow multiply
Currently, I can only think of this usage. And I recently learn that PyTorch C++ impls api _foreach_*
and maybe it can help with the efficient kernel?
Edit: just take your time!
@maxwellzh - hm, we do already have operators that take lengths as an argument and also return correct gradients (the user would have to pass in a 1d vector with the data flattened). Take a look at ctc_loss or EmbeddingBag in PyTorch.
Just correct me if I misunderstand. Aren't the lengths always required in this case? Without lengths, it's impossible to tell the start position of each sequence in the compact layout.
we do already have operators that take lengths as an argument and also return correct gradients
I looked the ctc_loss and EmbeddingBag, but didn't find such operator, can you give more details?
For example, ctc_loss has input arguments input_lengths and target_lengths where the targets can be a packed 1d input (similar to the data layout NestedTensor would provide). EmbeddingBag has an offsets input, which you can use to segment the 1d input into relevant pieces, so you can also pass a compact or packed input. It's just that in each case you explicitly have to pass in the metadata that encodes the segment lengths or offsets etc.
Say the original data is
A=[tensor([1., 2., 3.]), tensor([2.]), tensor([1., 2., 3., 4.])]
. For convenience of some other operations, I concat the data intoA_cat = tensor([1., 2., 3., 2., 1., 2., 3., 4.])
.Now my question is, how can I multiply
A_cat
with another tensor likeB=tensor([5., 3., 2.])
, so that each element ofB
multiply the each tensor inA
. (B
has the same length asA
). Is there any efficient way with native pytorch or nestedtensor lib?Thanks.