Open ibeltagy opened 4 years ago
Thanks for reporting @ibeltagy , we will take a look.
Thanks, @dlibenzi.
While loop based patch extraction is likely slower than convolution tricks:
@dlibenzi, sorry, I am not sure I am following how this link is related to unfold
I see. This is the c++ tensorflow version of torch.unfold
. But this is not something that can be called from pytorch-xla?
Cannot be called, but we can use the same idea (convolutions using kernels picking up one element at a time), for the forward.
hey @JackCaoG, I am just curious if there are updates here.
Hi @ibeltagy , I am working on the lowering part but it is a bit tricky. You will see the pr linked in this issue when it is ready 😄.
Thanks, @JackCaoG for the forward function in your PR here. I ran your code and I successfully get Counter: xla::unfold
and still get Counter: aten::unfold_backward
as expected. There are a few issues though,
The following unfold
takes close to 1 hour to compile the first time but it gets faster afterward. I know the first step is slow but is it expected to be 1 hour long? Notice that this is the trivial case where the seqlen == window size, so the output contains only one slice, no data copying is needed.
t.unfold(1, 512, 256) # t.shape = torch.Size([12, 512, 64]), t.unfold.shape = torch.Size([12, 1, 64, 512])
The following unfold
operation OOM even though it doesn't copy that much data. Is this expected? In comparison, a 12GB GPU has enough memory to process this
t.unfold(1, 512, 256) # t.shape = torch.Size([12, 2048, 64]), t.unfold.shape = torch.Size([12, 7, 64, 512])
Thanks
HI @ibeltagy I am not sure if 1 hour is too long, it really depends on your model size. Did you remember how much time it takes prior to the unfold
change?
For the second question I think I have an idea. During the lowering of unfold, for input with shape [12, 2048, 64], size=512, step=256, it will generate two iota vector of size [12 2048 64 - 512, 1, 12 2048 64 -512 ] and a filter of the same size. It will then use slice
to shrink the filter size with step
. You can easily see that this filter and two intermediate vector is huge. I might be able to think of way to perform the slice
on the iota vector instead after. That should save us some space but the filter itself is huge.
I chose this lowering is that convolution trick is likely much faster than the loop base approach. For pytorch native GPU unfold is just playing with the pointer and the stride, but for XLA we actually need to calculate the output and store it(unfold is not a view op on XLA). This is the downside with not being able to access the storage. Does this OOM issue block you from using XLA on this model?
Did you remember how much time it takes prior to the unfold change?
around 5 minutes
[12 2048 64 - 512, 1, 12 2048 64 -512 ]
Yeah, this is huge and won't work.
convolution trick is likely much faster than the loop base approach.
Can you elaborate on what the loop-based approach is? is it a loop with multiple slice
operations? If I implement this in the pytorch side, is it going to be as fast/slow as implementing it in the c++ side?
Does this OOM issue block you from using XLA on this model?
Yes, and the actual input is even larger, something like [16, 4096, 64], size=1024, step=512
. The 4096
dimension is the sequence length, which is very long for Longformer.
Hi @ibeltagy
5 minutes to 1 hour seems a big jump. One possibility is that unfold
was not lowered prior to this change and it is a pretty complex lowering (transpose + iota*2 + eq + couple reshape + convolution + transpose). If the metric suggest that there is one compile then most likely the time is from the unfold
. If you can dump the HLO graph I can double check that.
For the loop based approach, yes I was thinking about multiple slice
operation. Pytorch Slice is created as a view in here . If this is implemented in c++ and you don't need the view property of the unfold
, I think using xla::Slice
directly in here will be faster(didn't tested but maintaining a viewInfo is pretty complex).
If it is possible to implement unfold
as a view, that would be the ideal solution because it won't waste any memory, which is the bottleneck in the Longformer model.
Do you mind trying out the idea of splitting the tensor before unfold and concat the result afterward? something like
>>> torch.arange(12).reshape([2,2,3]).unfold(1, 2, 1)
tensor([[[[ 0, 3],
[ 1, 4],
[ 2, 5]]],
[[[ 6, 9],
[ 7, 10],
[ 8, 11]]]])
>>> torch.arange(12).reshape([2,2,3]).split(1)[0].unfold(1, 2, 1)
tensor([[[[0, 3],
[1, 4],
[2, 5]]]])
>>> torch.arange(12).reshape([2,2,3]).split(1)[1].unfold(1, 2, 1)
tensor([[[[ 6, 9],
[ 7, 10],
[ 8, 11]]]])
I will try to see if I can reduce the memory usage of the current implemantion and think a bit more about the slice approach.
I pushed a new change to the unfold pr, the peak memory usage should be reduced to 1/3 when step > 3.
Will try both and let you know. Thanks.
If you guys can post a simple repro, and dump the HLO graph, we could see what is going on.
print(torch_xla._XLAC._get_xla_tensors_hlo([unfold_result]))
I tried the iterative slicing that you suggested and found it to work well. The memory usage is low enough that I can run the model on long sequences, and the model is fast enough (1.7x slower than a GPU that uses as_strided
) that it is usable. Therefore, I don't think I will need the current lowering of unfold
especially that it is memory expensive.
Here's another thing that can use your help, and please let me know if I should move it to a separate issue. Right now the model is 1.7x slower than GPU. If you guys have any insights on how to make it faster, that would be great. And, I don't think the iterative unfold vs. as_strided is the main contributor to the slowdown. I tried the model with this part of the code removed and it was still slower than on a GPU.
The model code is here. It is the same as RoBERTa with the only difference being the selfattention operation. In particular, the two matrix multiplications here and here are replaced with the two functions _sliding_chunks_matmul_qk
and _sliding_chunks_matmul_pv
. I am also attaching the debug output which has a dump of the HLO graph debug.tar.gz.
Hi @ibeltagy , glad to hear that you get the unfold
working. Let's keep this thread about unfold
and open a new issue for the performance optimization😄 .
Sure. I will move the model optimization to a separate issue. One thing that's still relevant here is finding out if unfold
can be lowered as a view without additional memory. The iterative unfold that I am using is just a temporary hack.
Fore sure, we still want unfold
to be lowered in a way that is usable for you. We are a small team we have to pick tasks carefully, since this is not a blocker for you it is likely to be in a lower priority (compare to your optimization for example). I will keep this thread alive and keep you updated.
I tried the iterative slicing that you suggested and found it to work well. The memory usage is low enough that I can run the model on long sequences, and the model is fast enough (1.7x slower than a GPU that uses
as_strided
) that it is usable. Therefore, I don't think I will need the current lowering ofunfold
especially that it is memory expensive.Here's another thing that can use your help, and please let me know if I should move it to a separate issue. Right now the model is 1.7x slower than GPU. If you guys have any insights on how to make it faster, that would be great. And, I don't think the iterative unfold vs. as_strided is the main contributor to the slowdown. I tried the model with this part of the code removed and it was still slower than on a GPU. The model code is here. It is the same as RoBERTa with the only difference being the selfattention operation. In particular, the two matrix multiplications here and here are replaced with the two functions
_sliding_chunks_matmul_qk
and_sliding_chunks_matmul_pv
. I am also attaching the debug output which has a dump of the HLO graph debug.tar.gz.
Hi, @ibeltagy , I have similar issues when using unfold. Do you mind elaborating on how iterative slicing works? Maybe via an example?
Hi, is aten::unfold
lowered? I am getting the error below, not sure if there is a work around?
UserWarning: 0The operator aten::unfold appears to be a view operator, but it has no implementation for the backend "xla:0". View operators don't support falling back to run on the CPU, since the tensor's storage cannot be shared across devices. (Triggered internally at ../aten/src/ATen/native/CPUFallback.cpp:175.)
@coleridge72 I think unfold
now get dispatched to im2col
but we also don't have a lowering for that yet. You can follow up in https://github.com/pytorch/xla/issues/2932. This message is somewhat OK. You will get the right result but we fallback to CPU to execute unfold
which will create speed penalty.
🚀 Feature
Add a lowering for
unfold
.Motivation
I want to run Longformer (model code on HF repo) on pytroch-xla, and this requires an overlapping sliding window operation which needs a lowering for
unfold
.Pitch
Add a lowering for
unfold
Alternatives
Use
as_strided
but the current implementation is limited as discussed in this issue.Additional context
Below is the metric report for the forward pass of Longformer with
unfold
. It has entries foraten::unfold
.