tensorflow / swift-models

Models and examples built with Swift for TensorFlow
Apache License 2.0
649 stars 149 forks source link

Transformer model requires more parameters than supported on TPU #638

Open BradLarson opened 4 years ago

BradLarson commented 4 years ago

It has been pointed out by Wojtek Czarnowski that in specific cases the Transformer model (or components used within it) can trigger a compilation error in X10 on TPU:

2020-07-16 22:51:03.077357: F tensorflow/compiler/xla/xla_client/xla_util.cc:90] Invalid argument: From /job:tpu_worker/replica:0/task:0:
Computation requires more parameters (333) than supported (limit 237).
     [[{{node XRTCompile}}]]
Current stack trace:
    frame #17: 0x00007f6da8c0ceb2 $__lldb_expr102`partial apply for closure #1 in update(model:using:for:) at <Cell 14>:12:9
    frame #23: 0x00007f6da8c0c268 $__lldb_expr102`update(model=<unavailable>, optimizer=<unavailable>, batch=<unavailable>) at <Cell 14>:4:18
    frame #24: 0x00007f6d5000a483 $__lldb_expr132`closure #1 in  at <Cell 19>:20:31
    frame #25: 0x00007f6da48245b7 libjupyterInstalledPackages.so`time(repeating=1, f=0x00007f6d50009230 $__lldb_expr132`closure #1 () -> () in __lldb_expr_131 at <Cell 19>:4) at timing.swift:15:9 [opt]
    frame #26: 0x00007f6d5000914b $__lldb_expr132`main at <Cell 19>:4:1

He provided a reproducer notebook which can be opened and run in Colab. Choosing a GPU-backed instance lets this succeed, but running this notebook with a TPU-backed instance triggers the above crash.

texasmichelle commented 4 years ago

I ran into this running WordSeg on TPU as well:

Attempting to fetch value instead of handling error Invalid argument: Computation requires more parameters (5370) than supported (limit 4035).
brettkoonce commented 4 years ago
Starting training...
2020-10-05 23:42:13.702416: F tensorflow/compiler/xla/xla_client/xla_util.cc:90] Invalid argument: From /job:tpu_worker/replica:0/task:0:
2 root error(s) found.
  (0) Invalid argument: Computation requires more parameters (548) than supported (limit 236).
         [[{{node XRTCompile}}]]
  (1) Invalid argument: Computation requires more parameters (548) than supported (limit 236).
         [[{{node XRTCompile}}]]
         [[XRTCompile_G20]]
0 successful operations.
0 derived errors ignored.
Aborted (core dumped)

i can reproduce this by using tensorflow 2.3.1 as the base for the tpu. moving to a nightly build makes the crash go away:

Screen Shot 2020-10-05 at 6 54 01 PM

would suggest bumping what base image colab is using should fix this.

see also https://github.com/pytorch/xla/issues/1963