pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.47k stars 471 forks source link

Help with Debug basics? (Crashing with custom WaveFlow code) #2218

Closed CookiePPP closed 4 years ago

CookiePPP commented 4 years ago

❓ Questions and Help

Example notebook: https://colab.research.google.com/drive/1DuPUufKGg0xqkuqzunr1YgGE4_d1vbu3?usp=sharing

I'd like to ask for help with how to debug which line/operation is causing the crash.

I've read Troubleshooting from the readme however I still don't understand how to go about debugging specific operations/this issue.

Error Copy/Paste into Pastebin


Last few lines of the error


2020-06-13 12:38:56.612042: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] StackTrace:
2020-06-13 12:38:56.612053: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] *** Begin stack trace ***
2020-06-13 12:38:56.612065: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    tensorflow::CurrentStackTrace[abi:cxx11]()
2020-06-13 12:38:56.612078: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    xla::util::ReportComputationError(tensorflow::Status const&, absl::Span<xla::XlaComputation const* const>, absl::Span<xla::Shape const* const>)
2020-06-13 12:38:56.612092: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    xla::XrtComputationClient::CheckCompileStatus(tensorflow::Status const&, std::vector<xla::ComputationClient::CompileInstance, std::allocator<xla::ComputationClient::CompileInstance> > const&, xla::XrtComputationClient::SessionWork const&)
2020-06-13 12:38:56.612106: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    
2020-06-13 12:38:56.612118: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    
2020-06-13 12:38:56.612129: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    
2020-06-13 12:38:56.612141: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    
2020-06-13 12:38:56.612153: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    
2020-06-13 12:38:56.612166: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    clone
2020-06-13 12:38:56.612178: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] *** End stack trace ***
2020-06-13 12:38:56.612190: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]
2020-06-13 12:38:56.612201: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] Status: Not found: From /job:tpu_worker/replica:0/task:0:
2020-06-13 12:38:56.612213: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] Could not find registered platform with name: "Jellyfish"
2020-06-13 12:38:56.612225: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]     [[{{node XRTCompile_5}}]]

Edit: Can anybody confirm if F.interpolation works? I don't see any operations after that line.

Edit2: F.interpolation does work in an isolated environment. The search continues...

dlibenzi commented 4 years ago

Can you try to select nightly from the versions menu?

CookiePPP commented 4 years ago

@dlibenzi Can you provide an example of how to install nightly? I used the example from 'Getting Started with PyTorch on Cloud TPUs' to install in that notebook.

edit: I'm presenting a presentation for the next 2 hours. Thanks for the pic. I'll reply when back.

dlibenzi commented 4 years ago

image

CookiePPP commented 4 years ago

@dlibenzi Alright, sorry that took so long.

This should be the error with nightly.

And using the command XLA_GET_TENSORS_OPBYOP=1 XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 XLA_USE_BF16=1 python3 train.py -c config.json

2020-06-13 16:43:29.489064: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] 
2020-06-13 16:43:29.489076: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] 
2020-06-13 16:43:29.489088: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] OutputShape: (bf16[1,256,41]{1,2,0})
2020-06-13 16:43:29.489100: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] 
2020-06-13 16:43:29.489112: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] StackTrace:
2020-06-13 16:43:29.489124: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] *** Begin stack trace ***
2020-06-13 16:43:29.489137: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    tensorflow::CurrentStackTrace[abi:cxx11]()
2020-06-13 16:43:29.489153: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    xla::util::ReportComputationError(tensorflow::Status const&, absl::lts_2020_02_25::Span<xla::XlaComputation const* const>, absl::lts_2020_02_25::Span<xla::Shape const* const>)
2020-06-13 16:43:29.489168: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    xla::XrtComputationClient::CheckCompileStatus(tensorflow::Status const&, std::vector<xla::ComputationClient::CompileInstance, std::allocator<xla::ComputationClient::CompileInstance> > const&, xla::XrtComputationClient::SessionWork const&)
2020-06-13 16:43:29.489182: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    
2020-06-13 16:43:29.489205: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    
2020-06-13 16:43:29.489219: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    
2020-06-13 16:43:29.489231: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    
2020-06-13 16:43:29.489244: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    
2020-06-13 16:43:29.489256: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]    clone
2020-06-13 16:43:29.489269: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] *** End stack trace ***
2020-06-13 16:43:29.489281: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] 
2020-06-13 16:43:29.489294: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] Status: Not found: From /job:tpu_worker/replica:0/task:0:
2020-06-13 16:43:29.489307: E tensorflow/compiler/xla/xla_client/xla_util.cc:76] Could not find registered platform with name: "Jellyfish"
2020-06-13 16:43:29.489321: E tensorflow/compiler/xla/xla_client/xla_util.cc:76]     [[{{node XRTCompile_196}}]]
Traceback (most recent call last):
  File "train.py", line 626, in <module>
    train(num_gpus, args.rank, args.group_name, **train_config)
  File "train.py", line 465, in train
    outputs = model(mel, audio, speaker_ids)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 609, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/codedump/waveglow_TPU/efficient_model_ax.py", line 175, in forward
    audio, log_s = affine_coup(audio, cond, speaker_ids=speaker_ids)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 609, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/codedump/waveglow_TPU/efficient_modules.py", line 132, in forward
    log_s, t = self.WN(audio_0, spect, speaker_ids)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 609, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/codedump/waveglow_TPU/glow_ax.py", line 251, in forward
    spect = self._upsample_mels(spect, audio.shape)# [B, n_channels*n_layers, T//hop_length] -> [B, n_channels*n_layers, T//n_group]
  File "/content/codedump/waveglow_TPU/glow_ax.py", line 230, in _upsample_mels
    cond = F.interpolate(cond, size=audio_size[3], mode=self.upsample_mode, align_corners=True if self.upsample_mode == 'linear' else None)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 3154, in interpolate
    return torch._C._nn.upsample_linear1d(input, output_size, align_corners, sfl[0])
RuntimeError: Not found: From /job:tpu_worker/replica:0/task:0:
Could not find registered platform with name: "Jellyfish"
     [[{{node XRTCompile_196}}]]

I can see that the issue is upsampling. If you have another suggestion I'd be happy to listen, otherwise I'll try TransposedConv's or nearest neighbor.

dlibenzi commented 4 years ago

I can't make much sense in that Colab. Where is the training loop? Where are XLA devices requested? Where are XLA data loaders created?

CookiePPP commented 4 years ago

@dlibenzi It's all in the "codedump" folder then "waveglow_TPU".

I'm having trouble recreating the issue in a simpler example, but I'll try and provide a easy to read notebook as soon as I figure out what's gone wrong.

dlibenzi commented 4 years ago

The code is a bit messy, but from a quick look, you cannot do things like:

if rank == 0:
  do_op_which_uses_tensors(...)
RuskinManku commented 4 years ago

@dlibenzi I'm facing the same issue in Colab but with a different function.

File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 3958, in multi_head_attention_forward
    if torch.equal(query, key) and torch.equal(key, value):
RuntimeError: Not found: From /job:tpu_worker/replica:0/task:0:
Could not find registered platform with name: "Jellyfish"
     [[{{node XRTCompile}}]]

The code works completely fine with GPUs(I use .to(device) everywhere and I just changed the device when changing from GPU to TPU) so I don't think there's an error in the code. I have tried with all the 3 version options in colab: "1.5" , "20200325", "nightly" and the same error persists in all 3.

CookiePPP commented 4 years ago

@RuskinManku I did the same thing, converted all .cuda() calls to .to(xm.xla_device()) and used Google Colab TPUs for testing the code. GPU version of the code has been running fine for the past few weeks. :man_shrugging:

I stopped looking into my code quite quickly (it's a mess at the moment and I'm not in a rush to use TPUs). If you figure something out I'd love to know and I'll test out anything you find on my side as well.


edit: also used xm.optimizer_step(optimizer, barrier=True) for the optimizer as shown in the doc.

RuskinManku commented 4 years ago

@CookiePPP @dlibenzi Are you switching to Tensorflow V1(1.15.2) using the %tensorflow_version 1.x call? I was doing that but now i'm working with the default Tensorflow version(2.2.0) and something weird is happening. So I was running a TransformerEncoder module and it wraps TransformerEncoderLayer which further calls MultiheadAttention which further calls multi_head_attention_forward in functional.py and that's where the error was occurring. Now that I didn't switch the version no error is showing up, but the code is not running foward either, it just keeps on running without throwing any error. I think it's getting stuck somewhere....maybe it's too slow but I don't think thats the case because it has not been able to do one forward pass of batch size 11 for the past 45 minutes when a P100 GPU takes like less than a second. As I'm running directly the TransformerEncoder module that's why I can't tell where exactly it's getting stuck in there(or taking time) but it's getting stuck after calling TransformerEncoder that's for sure. Also I'm not explicitly using any tensorflow stuff in this code, so yeah..I guess it's how xla is interacting with tensorflow in backend.... Needless to say I tried this couple of times with the exact same code and yeah...version 1.x is throwing error and 2.x is getting stuck(or it's working but is just incredibly slow on TPU).

CookiePPP commented 4 years ago

@RuskinManku

Are you switching to Tensorflow V1(1.15.2) using the %tensorflow_version 1.x call?

Yep. Just removed %tensorflow_version 1.x and the code is running correctly (though 5 times slower than a RTX 2080 Ti seems off to me).


I'm not familiar with TPUs. Is each xm.xla_device() meant to be fast, or are TPUs useful because you can distribute across hundreds? Or is my code still messed up somewhere?

RuskinManku commented 4 years ago

@CookiePPP I'm glad it worked for you..I guess it was just some dependency thing that works with TF2.x and not 1.x I guess the code is working for me too but it's just incredibly slow...which defeats the purpose of using TPU but I didn't expect huge performance gains at first place.. I don't know much about TPUs either but they work good with large batch sizes and CNNs..Maybe try having a big batch size and then you might get comparable performance to RTX 2080ti edit: I'll increase my batch size too and see if it works better because right now it's just too slow

CookiePPP commented 4 years ago

@RuskinManku During training WaveFlow is almost entirely made up of large conv operations. I can increase batch size but I'll need to gradient checkpoint and at that point performance will probably only get worse.

Only thing I can think of is that dilations on the conv2d layer is slowing things down, or maybe Google Colab TPU's are an older variant?

I've still got my Tensorflow Research Cloud project so I should be able to use some V3 TPUs if I can figure out GCP again. (don't wait on me, GCP confuses the hell out of me)

dlibenzi commented 4 years ago

I have gotten a bit lost in this thread, and I am failing to understand what you are trying to do with tensorflow (you should not mix that with PyTorch/XLA).

In general, you cannot expect someone to dig inside thousands of lines of code for a repro. The Colab repro has a lot of input preprocessing stuff, which just messes up the understanding of the issue. It mount Drive folders, etc... We just do not have the bandwidth to debug and proposed changes on stuff like that.

CookiePPP commented 4 years ago

I am failing to understand what you are trying to do with tensorflow

%tensorflow_version 1.x is just leftover from when I was using the old hparams module. https://github.com/NVIDIA/tacotron2/issues/278

It is no longer needed, so I'll leave TensorFlow in whatever version Colab defaults to.


The Colab repro has a lot of input preprocessing stuff, which just messes up the understanding of the issue. It mount Drive folders, etc... We just do not have the bandwidth to debug and proposed changes on stuff like that.

Yes, sorry about that. The training code has been on my local machine for a good few months. I'm not really setup for online so everything in the notebook is construction code. (this is why the title of the issue is "Help with Debug basics?". I don't think it's reasonable to force you guys to look through this mess so I just asked how I can do the debugging myself)

jysohn23 commented 4 years ago

Per the %tensorflow_version magic, you don't want to make a call to that for pytorch/xla notebooks as the env-setup.py script already sets up the proper wheel and TPU versions. If you call that magic in your notebook, it'll override it causing potential version mismatch.