pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

Successive frames growing, but why? #1379

Closed hrbigelow closed 4 years ago

hrbigelow commented 4 years ago

❓ Questions and Help

In the attached report below, I see successive frames growing by ~30 lines at each. The relevant code is below. The approach I used was to load all of the training data (about 300 mb) into memory into two tensors (data_source.snd_data and data_source.mel_data) and then at each training step, fill the batch with a different slice of those tensors. I thought the varying slices at each iteration were causing graph recompilation. But, in the code below, I replace that step with the same hard-coded slice, and the problem remains.

Would anyone have any insights into this problem?

Any help would be greatly appreciated!

    def set(self, b, sample_slice, data_source):
        ss = sample_slice
        # self.voice_index[b] = ss.voice_index
        wo = ss.wav_offset
        mo = ss.mel_offset
        dws = ss.dec_wav_slice
        mis = ss.mel_in_slice

        self.lcond_slice[b] = ss.lcond_slice 
        self.loss_wav_slice[b] = ss.loss_wav_slice 
        # self.wav_input[b,...] = data_source.snd_data[wo + dws[0]:wo + dws[1]] 
        # self.mel_input[b,...] = data_source.mel_data[mo + mis[0]:mo +
        #         mis[1],:].transpose(1, 0)

        self.wav_input[b,...] = data_source.snd_data[3184397:3186543]
        self.mel_input[b,...] = \
174 =>            data_source.mel_data[19855:19899,:].transpose(1, 0)

xla.report.618294e.txt xla_metrics.618294e.txt

dlibenzi commented 4 years ago

That is generating different graphs as the b argument will become a pytorch/aten scalar, which will create a different graph, since that graph is the input into the forward+backward. I suggest instead to have a dataloader on CPU that creates the proper batch tensors, and then wrap that dataloader with a ParallelLoader (I assume you use pytorch/xla multi-processing).

hrbigelow commented 4 years ago

Ahh, very cool, Davide! No, I am not using multi-processing, since I don't have access to multiple TPUs. Is it beneficial to use ParallelLoader even in the case of running the model on a single TPU?

I see, so any time even a CPU value is used as an index for a tensor operation, that induces a different graph? Even if the shape of the resulting tensor is the same?

hrbigelow commented 4 years ago

So, in thinking about this, although the DataLoader will probably solve this problem, the larger problem is one I mentioned in the other issue here that I am trimming intermediate tensors using values computed on the CPU.

I'm wondering, if I first calculate all of those values on the CPU, put them in a tensor, and then send the tensor over to the TPU at the beginning of the step, then, would that avoid a graph recompilation?

Thanks again Davide!

Henry

dlibenzi commented 4 years ago

The TPU you have in Colab, has 8 cores 😉

https://cloud.google.com/tpu/docs/system-architecture#hardware_architecture

Like having 8 GPUs. So you can take advantage of replicated training with multi-processing. I think our new Colab we have in contrib/ should already show the use of multi-processing. If the number of samples you have is small, I suggest you to stick with a single core (like you are probably doing now). Actually, until you have a good single core performance/accuracy, I suggest you to stay with singe core as it is easier to debug.

If you create tensors on the CPU (assuming the same shape) and send them to TPU should be OK.

dlibenzi commented 4 years ago

I'm wondering, if I first calculate all of those values on the CPU, put them in a tensor, and then send the tensor over to the TPU at the beginning of the step, then, would that avoid a graph recompilation?

Yes. In general the input pipeline processing is done on CPU, and then the batches are sent to the device. Note that by doing this manual send to device, you will not take advantage of the ParallelLoader background sender, which is able to overlap TPU computation with data upload. This is why a DataLoader will provide you more a forward looking implementation (as you can then wrap that DataLoader with a ParallelLoader once you move to multi-processing).

hrbigelow commented 4 years ago

Oh wow! So, if the 8 TPU cores are fully used, would you say it's roughly 8x more powerful than a P100 GPU?

Also, for the more general problem I linked, is there any primitive which would allow the following sort of "trimming" operation to be done on TPU?

Point taken about the DataLoader and background sending. I hadn't appreciated that before.

So, as for the problem I linked before, which probably cannot be solved by the DataLoader because it is an intermediate calculation during the forward pass. Is there some function like torch.trim below:

# slices and x are both tensors on the TPU
slices = [ [3, 12], [0, 9], [1, 10], [2, 11] ]
x = ... # shape = [4, 12]

# x_trimmed.shape = [4, 9]
x_trimmed = torch.trim(input=x, dim=1, slices=slices)

I suppose I could do something like:

xf = x.flatten()
# flattened, expanded veresion of  [ [3, 12], [0, 9], [1, 10], [2, 11] ]
slice_inds = ... # [ 3, 4, 5, ..., 11, 12, 13, ..., 20, 22, 23, ..., 31, 33, 34, ..., 43 ]
xf_trim = torch.index_select(xf, 0, slice_inds)
x_trim = xf_trim.reshape(4, 9)

Or is there a better way?

dlibenzi commented 4 years ago

Oh wow! So, if the 8 TPU cores are fully used, would you say it's roughly 8x more powerful than a P100 GPU?

Making difference about per-core performance might not be very interesting. You can have core A that is 5x the performance of core B, but 10x the price. On Google Cloud 1 TPU v3-8 (8 cores TPU v3) costs about the same as a machine with 4 V100. Also performance depends a lot on the model architecture, so it is hard to say. You can look at MLPerf numbers to get an idea.

Also, for the more general problem I linked, is there any primitive which would allow the following sort of "trimming" operation to be done on TPU?

Point taken about the DataLoader and background sending. I hadn't appreciated that before.

So, as for the problem I linked before, which probably cannot be solved by the DataLoader because it is an intermediate calculation during the forward pass. Is there some function like torch.trim below:

# slices and x are both tensors on the TPU
slices = [ [3, 12], [0, 9], [1, 10], [2, 11] ]
x = ... # shape = [4, 12]

# x_trimmed.shape = [4, 9]
x_trimmed = torch.trim(input=x, dim=1, slices=slices)

I suppose I could do something like:

xf = x.flatten()
# flattened, expanded veresion of  [ [3, 12], [0, 9], [1, 10], [2, 11] ]
slice_inds = ... # [ 3, 4, 5, ..., 11, 12, 13, ..., 20, 22, 23, ..., 31, 33, 34, ..., 43 ]
xf_trim = torch.index_select(xf, 0, slice_inds)
x_trim = xf_trim.reshape(4, 9)

Or is there a better way?

If you have a TPU tensor holding the slice window (start, end), and then you are able to in-place update it, it should be fine (pseudo code):

w = tensor([0, 10])
for ...:
  i = t.slice(w)
  w += 10
  out = model(i)
  ...
hrbigelow commented 4 years ago

Hi Davide,

If you have a TPU tensor holding the slice window (start, end), and then you are able to in-place update it, it should be fine (pseudo code):

w = tensor([0, 10])
for ...:
  i = t.slice(w)
  w += 10
  out = model(i)
  ...

Ahh very cool, I will try that. One follow up question though: given that the above paeudocode requires a for-loop, it must be run on the CPU. I realize that this will produce tensor operations that can then be built and cached - but will it play well with the ParallelLoader background sender logic? That is, can this for-loop be executed in a concurrent fashion for the next step while the current step is running through the TPU?

Thanks again!

dlibenzi commented 4 years ago

Sorry ... for the for loop, I meant the per-step loop (the one which enumerate the samples and run the model on them).

hrbigelow commented 4 years ago

Hi Davide,

Ahh I see. The problem is that the trimming operation I need is for an intermediate calculation inside the model. Although, I do know the value of slices below, at the start of each training loop, but x in the example below is an intermediate calculation. I think what I need is torch.take, unless... is there some GPU/TPU operation that emulates a looping construct? I looked at torch.apply_ but it only works with CPU tensors. Basically I want to apply the slice operation to each element in the batch, but do that directly on the TPU.

# slices and x are both tensors on the TPU
slices = [ [3, 12], [0, 9], [1, 10], [2, 11] ]
x = ... # shape = [4, 12]

# x_trimmed.shape = [4, 9]
x_trimmed = torch.trim(input=x, dim=1, slices=slices)
dlibenzi commented 4 years ago

Do you have to process all the slices of x within a single step, or a single slice (sliding window?) of x for each step?

hrbigelow commented 4 years ago

I have to trim all of the slices of x in a single timestep.

For context, x is the output of a convolutional layer, and x_trim is the input to the next layer. The dimensions of x are [n_batch, n_timesteps]. And the dimensions of x_trim are the same semantics as x. The reason that the trimming is necessary is that each individual batch channel x[b,:] represents an intermediate calculation, ultimately arising from a random window taken from a different wav file. But, because the stack of convolutions involve both upsampling and downsampling layers before they get to x, a phasing phenomenon occurs such that different batch channels have a different phase relative to the full input.

Note though that, the different trimming ranges are known at the beginning of each training step, so I'd like to take advantage of that somehow.

On Sun, Nov 17, 2019 at 6:23 PM Davide Libenzi notifications@github.com wrote:

Do you have to process all the slices of x within a single step, or a single slice (sliding window?) of x for each step?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pytorch/xla/issues/1379?email_source=notifications&email_token=ABI3OFWC5RQQXF2UOZHWFGDQUH4CZA5CNFSM4JOK7CQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEI62TA#issuecomment-554822988, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI3OFS4EWBKPCDLCZMVPITQUH4CZANCNFSM4JOK7CQQ .

dlibenzi commented 4 years ago

Are the slices fixed for the training, or their content (count and window size) change at every training step?

hrbigelow commented 4 years ago

Are the slices fixed for the training, or their content (count and window size) change at every training step?

Well, something in between. The shape of x stays the same from each training step to the next. Also, the shape of x_trim (smaller than x) stays the same from each training step to the next. But, which elements are trimmed from each batch channel varies between channels and over training steps. But, in the example below, the content of the slices tensor can be calculated at the beginning of the training step, even though x is an intermediate tensor deep in the model.

# '*' means retained element of x
# '-' means element that is trimmed
# Assume batch size of 4.  The second dimension is the time, or 'window' dimension.

# training step 1
# slices = [ [ 2, 14], [0, 12], [3, 15], [1, 13] ]
# x.shape = [4, 15],  x_trim.shape = [4, 12]
# contents of x
--************-
************---
---************
-************--

# training step 2
# slices [ [2, 14], [1, 13], [0, 12], [3, 15] ]
# x.shape = [4, 15],  x_trim.shape = [4, 12]
# contents of x
--************-
-************--
************---
---************

# training step 3
# slices [ [ 0, 12], [1, 13], [2, 14], [3, 15] ]
# x.shape = [4, 15], x_trim.shape = [4, 12]
************---
-************--
--************-
---************
dlibenzi commented 4 years ago

I am honestly not sure how to do that fast (w/out going back to CPU) with pytorch/xla. The issue is, we need a slicing op whose bounds are dynamic (inputs comes via Tensor not via Python scalars). XLA has a dynamic slice op, which could have helped in that case, but it is not exposed to PyTorch directly:

https://www.tensorflow.org/xla/operation_semantics#dynamicslice

hrbigelow commented 4 years ago

Is torch.take one of the ops implemented in XLA?

dlibenzi commented 4 years ago

Is torch.take one of the ops implemented in XLA?

That might work. We are adding an XLA lowering.

dlibenzi commented 4 years ago

Though it might get too dense (the indices) using that...

hrbigelow commented 4 years ago

Yeah, I am not sure how efficient it would be relative to transferring to CPU, using slicing in a for-loop and then transferring back to GPU/TPU. Although, the other goal in this is to avoid graph recompilation.

How hard is it to write a new op in torch/cuda vs lower an existing op to XLA? Does the XLA version get to reuse any of the cuda code somehow?

On Tue, Nov 19, 2019 at 2:51 PM Davide Libenzi notifications@github.com wrote:

Though it might get too dense (the indices) using that...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pytorch/xla/issues/1379?email_source=notifications&email_token=ABI3OFU3JO5JMUUTCFI6VMTQURUWPA5CNFSM4JOK7CQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEQB5HY#issuecomment-555753119, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI3OFTR2NO5U2WAYVC2CYTQURUWPANCNFSM4JOK7CQQ .

dlibenzi commented 4 years ago

Let's see how it goes. Our take(), if dense like your case, would go via our XLA dense gather, which might be not too bad.

Adding a new PT op will need to be negotiated with PT. We can't just add an op, we need backward and autograd integration. Also we don't want to have XLA specific models. For certain ops, we might ask to, if that will result in performance gains.

hrbigelow commented 4 years ago

Sounds great Davide! I saw your commit and am eager to try it out. Thanks so much.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.