pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.5k stars 483 forks source link

Runs on GPU, error on TPU: Computation requires more parameters (546) than supported (limit 236) #1963

Closed hrbigelow closed 4 years ago

hrbigelow commented 4 years ago

❓ Questions and Help

Hi all,

Could anyone give a clue what might be going wrong? I have run this commit, from this colab

which has produced this output: debug run

Some lines from it are:

Exception in device=TPU:0: Invalid argument: From /job:tpu_worker/replica:0/task:0:
Computation requires more parameters (546) than supported (limit 236).
         [[{{node XRTCompile}}]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "ae-wavenet/train.py", line 56, in _mp_fn
    m.train(index)
  File "/content/ae-wavenet/chassis.py", line 127, in train
    loss = self.optim_step_fn()
  File "/content/ae-wavenet/chassis.py", line 95, in <lambda>
    optimizer_args={'closure': self.loss_fn}))
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 538, in optimizer_step
    loss = optimizer.step(**optimizer_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/adam.py", line 62, in step
    loss = closure()
  File "/content/ae-wavenet/chassis.py", line 178, in loss_fn
    self.run_batch()
  File "/content/ae-wavenet/chassis.py", line 170, in run_batch
    batch = next(self.data_iter)
  File "/content/ae-wavenet/chassis.py", line 34, in __next__
    vb = self.per_dev_loader.__next__()[0]
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
    return self.next()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 34, in next
    xm.mark_step()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 477, in mark_step
    wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
Computation requires more parameters (546) than supported (limit 236).
         [[{{node XRTCompile}}]]
Writing run results to /tmp/debug_run-eef90b0a0f8e-root-0
XLA Environment:
  XRT_TPU_CONFIG=tpu_worker;0;10.74.90.234:8470
  TF_FORCE_GPU_ALLOW_GROWTH=true
  XLA_IR_DEBUG=1
  XLA_HLO_DEBUG=1
  TF_CPP_LOG_THREAD_ID=1
  TF_CPP_VMODULE=tensor=5,computation_client=5,xrt_computation_client=5,aten_xla_type=1
  XLA_SAVE_TENSORS_FILE=/tmp/debug_run-eef90b0a0f8e-root-0/graphs
  XLA_METRICS_FILE=/tmp/debug_run-eef90b0a0f8e-root-0/metrics

The same code has run successfully on my GTX1070 Max-Q laptop environment with PyTorch version 1.3.1

I've never seen the error before (but it has been several months since I've used torch_xla)

Thanks in advance!

dlibenzi commented 4 years ago

@jysohn23 this stuff keeps cropping up. Seems like continuations are enabled in the next-gen executor. Did we find the clue?

jysohn23 commented 4 years ago

Hey @hrbigelow I don't have access to the datasets you're using on your colab gdrive, but it looks like you're using an old colab template. Use this in your first setup cell instead, which should lead you to our new runtime:

VERSION = "20200325"  #@param ["1.5" , "20200325", "nightly"]
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION

And yes @dlibenzi, similar to this case the command to update the runtime was targeting old runtime but as long as we use this setup script things it should correctly update to new runtime.

hrbigelow commented 4 years ago

Hi @jysohn23 @dlibenzi

Thanks very much - it works with that new preamble.

So by the way for future reference, if I wanted to make the colab runnable for you, what else would I need to do? One thing is, I have to re-mount my gdrive each time I reconnect, so I'm not sure if that part would be reproducible for you. Is there a better place to host and store data files for use with Colab, so that I can allow others to run it?

Thanks again,

Henry

jysohn23 commented 4 years ago

I think if you could reproduce the error with some fake data generator that'd be ideal, if not putting it up temporarily somewhere in like GCS bucket would work for us too. Davide may have some other opinions.

hrbigelow commented 4 years ago

Ahh good idea. And thanks for the sleek preamble, much cleaner.

vihari commented 4 years ago

I was running on GCP with Jupyter Notebook and faced the exact problem.
In my case, it turns it is because I did not set the TPU software version to pytorch-1.6. The mismatch in my conda env and TPU version caused this cryptic error.

qsh-zh commented 3 years ago

I encountered the similar problem when running a simple cifar classification task, it raises the error after about 2000 iterations

Exception in device=TPU:7: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
     [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
     [[{{node XRTCompile}}]]
     [[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/content/jam/jamtorch/xla/utils.py", line 78, in new_fn
    value = func(config)
  File "/content/jam/example/jamtorch/tpuddp/main.py", line 32, in run
    trainer.train()
  File "/content/jam/jamtorch/trainer/genetic_trainer.py", line 212, in train
    self.train_step(batch)
  File "/content/jam/jamtorch/trainer/genetic_trainer.py", line 240, in train_step
    if self.loss_backward(loss):
  File "/content/jam/jamtorch/trainer/genetic_trainer.py", line 258, in loss_backward
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
     [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
     [[{{node XRTCompile}}]]
     [[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.

I used the provided environment of PyTorch on Cloud TPUs: MultiCore Training AlexNet on Fashion MNIST. The error exists after I switch to gcloud tpu.

JackCaoG commented 3 years ago

Yea, it is the hard limit of number of parameter right now. I will work on a change on the pt/xla side to pass the parameter as a tuple and that should solve this error.

rwightman commented 2 years ago

@JackCaoG I just ran into this error and same limit as last poster (limit 3306) ... working with a larger model (but not unreasonably large) that should fit in memory fine... I was actually training it with the TPUv4 while I had alpha access and it was going well, cannot get it working with a v3 TPU-VM to finish training. Using PyTorch XLA 1.10

JackCaoG commented 2 years ago

I dropped this project to work on something else last year. @rwightman My guess is that v4 we have larger smem hence more parameter can fit in. We are working on runtime migration which will solve this issue permanently. The target timeline is end of June. We can potentially fix this issue on existing runtime(my estimation is it will take ~ 2 weeks) but it will likely be wasted in ~ 6 months when we do the switch. How urgent is your use case?

rwightman commented 2 years ago

@JackCaoG I'm running through some larger candidate vision models for medium-large scale CLIP / LiT / etc image-text model pretraining. I hope to include a script with working hparams for reproducing such training on TPU, GPU, (maybe IPU) w/ PyTorch ... so models are fairly large, and hope to go a bit larger still... but so far I've kept within what I thought would be reasonable to test on a single v3-8. I can resume training this one on a 4x GPU machine so no urgency there.

Once I sort out the rest of the setup and get further along with the runs on larger dataset will likely run into this limit. Not sure how long all that will take but I can probably work around this for a bit. It does appear that it'd be easy to hit in any scenario where pod use is needed (models too large to fit decent batch sizes on a single accelerator). So surprised more people haven't hit it.

JackCaoG commented 2 years ago

sounds good, I will keep you updated regarding this issue.