Closed hrbigelow closed 4 years ago
@jysohn23 this stuff keeps cropping up. Seems like continuations are enabled in the next-gen executor. Did we find the clue?
Hey @hrbigelow I don't have access to the datasets you're using on your colab gdrive, but it looks like you're using an old colab template. Use this in your first setup cell instead, which should lead you to our new runtime:
VERSION = "20200325" #@param ["1.5" , "20200325", "nightly"]
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION
And yes @dlibenzi, similar to this case the command to update the runtime was targeting old runtime but as long as we use this setup script things it should correctly update to new runtime.
Hi @jysohn23 @dlibenzi
Thanks very much - it works with that new preamble.
So by the way for future reference, if I wanted to make the colab runnable for you, what else would I need to do? One thing is, I have to re-mount my gdrive each time I reconnect, so I'm not sure if that part would be reproducible for you. Is there a better place to host and store data files for use with Colab, so that I can allow others to run it?
Thanks again,
Henry
I think if you could reproduce the error with some fake data generator that'd be ideal, if not putting it up temporarily somewhere in like GCS bucket would work for us too. Davide may have some other opinions.
Ahh good idea. And thanks for the sleek preamble, much cleaner.
I was running on GCP with Jupyter Notebook and faced the exact problem.
In my case, it turns it is because I did not set the TPU software version to pytorch-1.6. The mismatch in my conda env and TPU version caused this cryptic error.
I encountered the similar problem when running a simple cifar classification task, it raises the error after about 2000 iterations
Exception in device=TPU:7: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/content/jam/jamtorch/xla/utils.py", line 78, in new_fn
value = func(config)
File "/content/jam/example/jamtorch/tpuddp/main.py", line 32, in run
trainer.train()
File "/content/jam/jamtorch/trainer/genetic_trainer.py", line 212, in train
self.train_step(batch)
File "/content/jam/jamtorch/trainer/genetic_trainer.py", line 240, in train_step
if self.loss_backward(loss):
File "/content/jam/jamtorch/trainer/genetic_trainer.py", line 258, in loss_backward
loss.backward()
File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
(0) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
[[{{node XRTCompile}}]]
(1) Invalid argument: Computation requires more parameters (6096) than supported (limit 3306).
[[{{node XRTCompile}}]]
[[XRTCompile_G3]]
0 successful operations.
0 derived errors ignored.
I used the provided environment of PyTorch on Cloud TPUs: MultiCore Training AlexNet on Fashion MNIST
. The error exists after I switch to gcloud tpu.
Yea, it is the hard limit of number of parameter right now. I will work on a change on the pt/xla side to pass the parameter as a tuple and that should solve this error.
@JackCaoG I just ran into this error and same limit as last poster (limit 3306)
... working with a larger model (but not unreasonably large) that should fit in memory fine... I was actually training it with the TPUv4 while I had alpha access and it was going well, cannot get it working with a v3 TPU-VM to finish training. Using PyTorch XLA 1.10
I dropped this project to work on something else last year. @rwightman My guess is that v4 we have larger smem
hence more parameter can fit in. We are working on runtime migration which will solve this issue permanently. The target timeline is end of June. We can potentially fix this issue on existing runtime(my estimation is it will take ~ 2 weeks) but it will likely be wasted in ~ 6 months when we do the switch. How urgent is your use case?
@JackCaoG I'm running through some larger candidate vision models for medium-large scale CLIP / LiT / etc image-text model pretraining. I hope to include a script with working hparams for reproducing such training on TPU, GPU, (maybe IPU) w/ PyTorch ... so models are fairly large, and hope to go a bit larger still... but so far I've kept within what I thought would be reasonable to test on a single v3-8. I can resume training this one on a 4x GPU machine so no urgency there.
Once I sort out the rest of the setup and get further along with the runs on larger dataset will likely run into this limit. Not sure how long all that will take but I can probably work around this for a bit. It does appear that it'd be easy to hit in any scenario where pod use is needed (models too large to fit decent batch sizes on a single accelerator). So surprised more people haven't hit it.
sounds good, I will keep you updated regarding this issue.
❓ Questions and Help
Hi all,
Could anyone give a clue what might be going wrong? I have run this commit, from this colab
which has produced this output: debug run
Some lines from it are:
The same code has run successfully on my GTX1070 Max-Q laptop environment with PyTorch version 1.3.1
I've never seen the error before (but it has been several months since I've used torch_xla)
Thanks in advance!