pytorch / nestedtensor

[Prototype] Tools for the concurrent manipulation of variably sized Tensors.
BSD 3-Clause "New" or "Revised" License
252 stars 28 forks source link

How can I use nested tensor with concated tensor? #453

Open maxwellzh opened 3 years ago

maxwellzh commented 3 years ago

Say the original data is A=[tensor([1., 2., 3.]), tensor([2.]), tensor([1., 2., 3., 4.])]. For convenience of some other operations, I concat the data into A_cat = tensor([1., 2., 3., 2., 1., 2., 3., 4.]).

Now my question is, how can I multiply A_cat with another tensor like B=tensor([5., 3., 2.]), so that each element of B multiply the each tensor in A. (B has the same length as A). Is there any efficient way with native pytorch or nestedtensor lib?

Thanks.

cpuhrsch commented 2 years ago

Hello @maxwellzh,

Thank you for posting your question. Could you please detail what the expected output of this operation is you want to perform?

Thank you, Christian

maxwellzh commented 2 years ago

I also post on pytorch forum, which might be more clear. https://discuss.pytorch.org/t/efficient-way-to-multiply-two-tensor-with-different-lengths/132029

cpuhrsch commented 2 years ago

Oh I see, so your code snippet is

A = tensor[(1., 2., 3., 1., 2., 1., 2., 3., 4.])
# len_A indicates the real length of each element in A
len_A = [3, 2, 4]
weight_A = tensor([0.2, 0.3, 0.5])
# how can I efficiently make the calculation without a loop
# 3, 5, 9 are computed with len_A
A[:3] *= weight_A[0]
A[3:5] *= weight_A[1]
A[5:9] *= weight_A[2]

So with NestedTensor you could achieve this using the constructor via

result_nt = (nestedtensor.nested_tensor([A[:3], A[3:5], A[5:9]) * weight_A)
result = result_nt.flatten()  # possibly need to use a different operator

However, this would incur a copy. You'd probably likely rather want to have an API similar to as_strided that you can pass a flat buffer and your lengths.

Do you need autograd for this operation? Please keep in mind that NestedTensor currently is inference only.

maxwellzh commented 2 years ago

Thank you for your reply! Autograd is not required. But I wonder if there is any way to avoid the slicing and copy of A?

This might be somewhat off the nestedtensor topic, but I want to discuss a bit more: One probable way is to exapnd weight_A according to len_A with functions torch.repeat_interleave and torch.gather, then weight_A becomes [0.2,0.2,0.2,0.3,0.3,0.5,0.5,0.5,0.5] and therefore could be directly multiplied to A.

However, ideally, my expectation is making the somehow multiplying A by weight_A as efficient as multiplying an array by a scalar with broadcast. And broadcasting array operation might be more efficient.

Refers https://numpy.org/doc/stable/user/theory.broadcasting.html#array-broadcasting-in-numpy

cpuhrsch commented 2 years ago

@maxwellzh - So, effectively you need a segmented multiplication that you can pass lengths to avoid expanding memory. NestedTensor could do that for you, maybe via a function such as this:

nt = nestedtensor.as_nested(A, [torch.size(3), torch.size(1), torch.size(4)]) This will give you a NestedTensor that looks like this

NestedTensor([
  [1, 2, 3],
  [2],
  [1, 2, 3, 4]
])

You then multiply this by weight_A after reshaping, so

nt * weight_A.reshape(3, 1)

This will give you

NestedTensor([
    [0.2, 0.4, 0.6],
    [0.6],
    [0.5, 1.0, 1.5, 2.0]
)]

You can then flatten this and get the result as [0.2, 0.4, 0.6, 0.6, 0.5, 1.0, 1.5, 2.0]

Could this resolve your issue?

Note that we'd need to implement the following

  1. as_nested
  2. NestedTensor.flatten
  3. An efficient kernel for this segmented multiplication

Do you want to only run this on CPU or GPU also?

Thanks, Christian

cpuhrsch commented 2 years ago

Could I also ask you for your use case for this operation?

maxwellzh commented 2 years ago

@maxwellzh - So, effectively you need a segmented multiplication that you can pass lengths to avoid expanding memory. NestedTensor could do that for you, maybe via a function such as this:

nt = nestedtensor.as_nested(A, [torch.size(3), torch.size(1), torch.size(4)]) This will give you a NestedTensor that looks like this

NestedTensor([
  [1, 2, 3],
  [2],
  [1, 2, 3, 4]
])

You then multiply this by weight_A after reshaping, so

nt * weight_A.reshape(3, 1)

This will give you

NestedTensor([
    [0.2, 0.4, 0.6],
    [0.6],
    [0.5, 1.0, 1.5, 2.0]
)]

You can then flatten this and get the result as [0.2, 0.4, 0.6, 0.6, 0.5, 1.0, 1.5, 2.0]

Could this resolve your issue?

Note that we'd need to implement the following

  1. as_nested
  2. NestedTensor.flatten
  3. An efficient kernel for this segmented multiplication

Do you want to only run this on CPU or GPU also?

Thanks, Christian

Hi @cpuhrsch , If the memory layout of A and nt is the same so allocation & copy are avoided, that's exactly what I want.

About the NestedTensor.flatten OP, I think an in-place NestedTensor.flatten_ is also required.

As for my real use case, it's a speech recognition system impl. The inputs (e.g. the audio frames) to my model is of variable lengths. So generally, we pad the short sequences to meet the max length of the sequence in one mini-batch with zeros:

A_pad=[
[1, 2, 3, 0],
[2, 0, 0, 0],
[1, 2, 3, 4]]

However, paddings lead to extra memory overhead. In most of the speech recognition systems, these waste of memory is acceptable. But for a specific system, called RNN-Transducer, a super big tensor of size (N, T, U, V), which includes two dimensional paddings along T and U, is introduced in computation. A such tensor with shape (16, 512, 128, 1024) takes 16×512×128×1024×4Byte=4295MB storage with single-precison float representation. So we have to avoid any unnecessary memory copy of the huge tensor. Back to our topic, to reduce the memory occupations, I remove paddings and flatten it to shape (S, V), where S denotes total length of N batches real values. And here comes the problem, when I want to multiply the concated tensor by a weight of shape (N, ).

Edit: The figure in this discussion might help to better understand. https://github.com/1ytic/warp-rnnt/pull/26 And currently, GPU only is OK in my case.

cpuhrsch commented 2 years ago

Ok great, this makes sense! In terms of timing I don't currently have much time available to work on this specifically, but I'll try to squeeze it in over the coming weeks (no promises!). For your case specifically, it seems that the hardest part is writing an efficient kernel and modifying the RNN-T code. Aside from maybe a nicer input interface as opposed to passing data and lengths separately, what else would you like to see from this? Or do you plan to use this for the entire acoustic model and then feed it into the loss?

maxwellzh commented 2 years ago

Specifically, RNN-T loss with compact layout computes costs with respect to each sequence, so costs is of shape (N,). In the backward pass, we need to multiply grad_costs to get the gradients according to the chain rule:

grad: (S, V), also represented as (T0*(U0+1)+ T1*(U1+1)+..., V)
grad_costs: (N, )
grad_output = grad_costs ?* grad <- the somehow multiply

Currently, I can only think of this usage. And I recently learn that PyTorch C++ impls api _foreach_* and maybe it can help with the efficient kernel?

Edit: just take your time!

cpuhrsch commented 2 years ago

@maxwellzh - hm, we do already have operators that take lengths as an argument and also return correct gradients (the user would have to pass in a 1d vector with the data flattened). Take a look at ctc_loss or EmbeddingBag in PyTorch.

maxwellzh commented 2 years ago

Just correct me if I misunderstand. Aren't the lengths always required in this case? Without lengths, it's impossible to tell the start position of each sequence in the compact layout.

we do already have operators that take lengths as an argument and also return correct gradients

I looked the ctc_loss and EmbeddingBag, but didn't find such operator, can you give more details?

cpuhrsch commented 2 years ago

For example, ctc_loss has input arguments input_lengths and target_lengths where the targets can be a packed 1d input (similar to the data layout NestedTensor would provide). EmbeddingBag has an offsets input, which you can use to segment the 1d input into relevant pieces, so you can also pass a compact or packed input. It's just that in each case you explicitly have to pass in the metadata that encodes the segment lengths or offsets etc.