pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 478 forks source link

Computation requires more parameters (3311) than supported (limit 3305). #3453

Open ikergarcia1996 opened 2 years ago

ikergarcia1996 commented 2 years ago

🐛 Bug

I am trying to train a model from this repository: https://github.com/ikergarcia1996/Self-Driving-Car-in-Video-Games using a TPU v3-8 VM Even when I train a tiny 9M parameter model with a small batch size, I get the following error:

2022-03-28 11:57:16.078552: E tensorflow/core/tpu/kernels/tpu_compilation_cache_external.cc:113] Computation requires more parameters (3311) than supported (limit 3305).

The error always seems to happen at step 9. The model works as expected when running in a GPU/CPU.

I found that other people also found this issue with large models (#1963), but in my case it happens with models of any size.

To Reproduce

I am trying to train this model: https://github.com/ikergarcia1996/Self-Driving-Car-in-Video-Games/blob/master/model.py#L778 from this repository: https://github.com/ikergarcia1996/Self-Driving-Car-in-Video-Games

Here is a small colab notebook to reproduce the issue. There are no TPU available right now in colab so I cannot test if I get the same error in colab. I am using a TPU v3-8 VM: https://colab.research.google.com/drive/1nVbJooUMvMMc8V9F6ioqrYkhyuiveN1i?usp=sharing

Expected behavior

The model works fine when training with a GPU/CPU

Environment

tensorflow 2.9.0 (tf-nightly, if I use the stable release i a get a weird error: "DefaultDeviceShapeRepresentation not available in this library" error) torch 1.11.0 torch-xla 1.11 pytorch-lightning==1.6.0rc1 (installed from source, wandb crashes with the stable release) cloud-tpu-client==0.10

TPU

image

Full environment

absl-py==1.0.0
aiohttp==3.8.1
aiosignal==1.2.0
astunparse==1.6.3
async-timeout==4.0.2
attrs==19.3.0
Automat==0.8.0
blinker==1.4
cachetools==5.0.0
certifi==2021.10.8
chardet==3.0.4
charset-normalizer==2.0.12
Click==7.0
cloud-init==22.1
cloud-tpu-client==0.10
colorama==0.4.3
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
cryptography==2.8
Cython==0.29.14
dbus-python==1.2.16
distlib==0.3.4
distro==1.4.0
distro-info===0.23ubuntu1
docker-pycreds==0.4.0
entrypoints==0.3
filelock==3.6.0
flatbuffers==1.12
frozenlist==1.3.0
fsspec==2022.2.0
future==0.18.2
gast==0.4.0
gitdb==4.0.9
GitPython==3.1.27
google-api-core==2.7.1
google-api-python-client==2.42.0
google-auth==2.6.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.55.0
grpcio==1.44.0
h5py==3.6.0
httplib2==0.20.4
hyperlink==19.0.0
idna==3.3
imageio==2.16.1
importlib-metadata==4.11.3
incremental==16.10.1
intel-openmp==2022.0.2
Jinja2==2.10.1
jsonpatch==1.22
jsonpointer==2.0
jsonschema==3.2.0
keras==2.8.0
Keras-Applications==1.0.8
keras-nightly==2.9.0.dev2022032707
Keras-Preprocessing==1.1.2
keyring==18.0.1
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
libclang==13.0.0
libtpu-nightly==0.1.dev20220303
Markdown==3.3.6
MarkupSafe==1.1.0
mkl==2022.0.2
mkl-include==2022.0.2
mock==4.0.3
more-itertools==4.2.0
multidict==6.0.2
netifaces==0.10.4
networkx==2.7.1
numpy==1.22.3
oauth2client==4.1.3
oauthlib==3.1.0
opencv-python==4.5.5.64
opt-einsum==3.3.0
packaging==21.3
pathtools==0.1.2
pbr==5.8.1
pexpect==4.6.0
Pillow==9.0.1
platformdirs==2.5.1
promise==2.3
protobuf==3.19.4
psutil==5.9.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyDeprecate==0.3.1
PyGObject==3.36.0
PyHamcrest==1.9.0
PyJWT==1.7.1
pymacaroons==0.13.0
PyNaCl==1.3.0
pyOpenSSL==19.0.0
pyparsing==3.0.7
pyrsistent==0.15.5
pyserial==3.4
python-apt==2.0.0+ubuntu0.20.4.7
python-dateutil==2.8.2
python-debian===0.1.36ubuntu1
pytorch-lightning==1.6.0rc1
pytz==2021.3
PyWavelets==1.3.0
PyYAML==5.4.1
requests==2.27.1
requests-oauthlib==1.3.1
requests-unixsocket==0.2.0
rsa==4.8
scikit-image==0.19.2
scipy==1.8.0
SecretStorage==2.3.1
sentry-sdk==1.5.8
service-identity==18.1.0
setproctitle==1.2.2
shortuuid==1.0.8
simplejson==3.16.0
six==1.16.0
smmap==5.0.0
sos==4.3
ssh-import-id==5.10
systemd-python==234
tabulate==0.8.9
tb-nightly==2.9.0a20220326
tbb==2021.5.1
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.0
tensorflow-estimator==2.8.0
tensorflow-io-gcs-filesystem==0.24.0
termcolor==1.1.0
testresources==2.0.1
tf-estimator-nightly==2.9.0.dev2022032708
tf-nightly==2.9.0.dev20220327
tifffile==2022.3.25
torch==1.11.0
torch-xla==1.11
torchmetrics==0.7.3
torchvision==0.12.0
tqdm==4.63.1
Twisted==18.9.0
typing-extensions==4.1.1
ubuntu-advantage-tools==27.6
ufw==0.36
unattended-upgrades==0.1
uritemplate==3.0.1
urllib3==1.26.8
virtualenv==20.13.3
wadllib==1.3.3
wandb==0.12.11
Werkzeug==2.0.3
wrapt==1.14.0
yarl==1.7.2
yaspin==2.1.0
zipp==1.0.0
zope.interface==4.7.1

Additional context

Full Traceback

GPU available: False, used: False
TPU available: True, using: 8 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2022-03-28 11:51:19.576529: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:19.576621: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:42.822620: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:42.822684: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:44.296726: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:44.296816: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:45.496880: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:45.496949: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:47.001208: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:47.001272: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:47.413516: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:47.413576: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:47.849663: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:47.849722: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:49.076428: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:49.076495: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
Total training samples: 1270669.
Total training samples: 1270669.
Total validation samples: 4038.
Total validation samples: 4038.
Total training samples: 1270669.
Total training samples: 1270669.
Total training samples: 1270669.
Total training samples: 1270669.
Total validation samples: 4038.
Total validation samples: 4038.
Total validation samples: 4038.
Total validation samples: 4038.
Total training samples: 1270669.
Total validation samples: 4038.
Total training samples: 1270669.
Total validation samples: 4038.

   | Name                         | Type                | Params
----------------------------------------------------------------------
0  | model                        | TEDD1104Transformer | 9.3 M
1  | train_accuracy               | Accuracy            | 0
2  | test_accuracy_k1_macro       | Accuracy            | 0
3  | test_accuracy_k3_micro       | Accuracy            | 0
4  | validation_accuracy_k1_micro | Accuracy            | 0
5  | validation_accuracy_k3_micro | Accuracy            | 0
6  | validation_accuracy_k1_macro | Accuracy            | 0
7  | validation_accuracy_k3_macro | Accuracy            | 0
8  | test_accuracy_k1_micro       | Accuracy            | 0
9  | test_accuracy_k3_macro       | Accuracy            | 0
10 | validation_distance          | MeanSquaredError    | 0
11 | criterion                    | WeightedMseLoss     | 0
12 | Controller2Keyboard          | Controller2Keyboard | 0
----------------------------------------------------------------------
9.3 M     Trainable params
0         Non-trainable params
9.3 M     Total params
37.374    Total estimated model params size (MB)
Epoch 0:   0%|                                                                                                                   | 8/39963 [04:34<380:50:43, 34.31s/it, loss=0.43, v_num=base]2022-03-28 11:57:16.078552: E tensorflow/core/tpu/kernels/tpu_compilation_cache_external.cc:113] Computation requires more parameters (3311) than supported (limit 3305).
2022-03-28 11:57:16.078644: F tensorflow/core/tpu/kernels/tpu_program_group.cc:86] Check failed: xla_tpu_programs.size() > 0 (0 vs. 0)
https://symbolize.stripped_domain/r/?trace=7f19fc95103b,7f19fc9510bf,7f18fed31bcf,7f18f93a6922,7f18f9364ebd,7f18f93b4db0,7f18f93b48ae,7f18f5216ed3,7f18fa8581b8,7f18fe7e38a0,7f18fe7e5633,7f18fecfacb1,7f18fecfa4e0,7f18fece28cb,7f19fc8f3608&map=b5462df73b9bb298b2bca5d2f02176eed80a2e90:7f18f08d2000-7f1901bc7e30 
*** SIGABRT received by PID 714187 (TID 714993) on cpu 24 from PID 714187; stack trace: ***
PC: @     0x7f19fc95103b  (unknown)  raise
    @     0x7f18efd34cda        992  (unknown)
    @     0x7f19fc9510c0       3968  (unknown)
    @     0x7f18fed31bd0         16  tensorflow::internal::LogMessageFatal::~LogMessageFatal()
    @     0x7f18f93a6923        592  tensorflow::tpu::TpuProgramGroup::Initialize()
    @     0x7f18f9364ebe       1488  tensorflow::tpu::TpuCompilationCacheExternal::InitializeEntry()
    @     0x7f18f93b4db1        800  tensorflow::tpu::TpuCompilationCacheInterface::CompileIfKeyAbsentHelper()
    @     0x7f18f93b48af        496  tensorflow::tpu::TpuCompilationCacheInterface::CompileIfKeyAbsent()
    @     0x7f18f5216ed4        912  tensorflow::XRTCompileOp::Compute()
    @     0x7f18fa8581b9        432  tensorflow::XlaDevice::Compute()
    @     0x7f18fe7e38a1       2128  tensorflow::(anonymous namespace)::ExecutorState<>::Process()
    @     0x7f18fe7e5634         48  std::_Function_handler<>::_M_invoke()
    @     0x7f18fecfacb2        128  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f18fecfa4e1         48  tensorflow::thread::EigenEnvironment::CreateThread()::{lambda()#1}::operator()()  
    @     0x7f18fece28cc         80  tensorflow::(anonymous namespace)::PThread::ThreadFn()
    @     0x7f19fc8f3609  (unknown)  start_thread
https://symbolize.stripped_domain/r/?trace=7f19fc95103b,7f18efd34cd9,7f19fc9510bf,7f18fed31bcf,7f18f93a6922,7f18f9364ebd,7f18f93b4db0,7f18f93b48ae,7f18f5216ed3,7f18fa8581b8,7f18fe7e38a0,7f18fe7e5633,7f18fecfacb1,7f18fecfa4e0,7f18fece28cb,7f19fc8f3608&map=b5462df73b9bb298b2bca5d2f02176eed80a2e90:7f18f08d2000-7f1901bc7e30,50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00 
E0328 11:57:16.307495  714993 coredump_hook.cc:365] RAW: Remote crash data gathering hook invoked.
E0328 11:57:16.307515  714993 coredump_hook.cc:411] RAW: Skipping coredump since rlimit was 0 at process start.
E0328 11:57:16.307529  714993 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0328 11:57:16.307539  714993 coredump_hook.cc:473] RAW: Sending fingerprint to remote end.
E0328 11:57:16.307550  714993 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0328 11:57:16.307559  714993 coredump_hook.cc:477] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0328 11:57:16.307565  714993 coredump_hook.cc:550] RAW: Discarding core.
E0328 11:57:16.732287  714993 process_state.cc:771] RAW: Raising signal 6 with default behavior
2022-03-28 11:57:17.327159: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468637.326935100","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.529565: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.529380655","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.529851: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.529622009","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.530091: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.529921516","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574131: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.573979797","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574730: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1648468642.574629478","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574755: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1648468642.574701121","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574797: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1648468642.574746230","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574778: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.574625986","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574723: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.574472508","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574892: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1648468642.574855850","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714282 (TID 714282) on cpu 63 from PID 711980; stack trace: ***
PC: @     0x7f19fc8fa376  (unknown)  pthread_cond_wait@@GLIBC_2.3.2
    @     0x7f18efd34cda        992  (unknown)
    @     0x7f19fc9510c0  (unknown)  (unknown)
    @                0x1  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:24.459454  714282 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:24.487866  714282 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714288 (TID 714288) on cpu 72 from PID 711980; stack trace: ***
PC: @     0x7f19fc8fa376  (unknown)  pthread_cond_wait@@GLIBC_2.3.2
    @     0x7f18efd34cda        992  (unknown)
    @     0x7f19fc9510c0  (unknown)  (unknown)
    @                0x1  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:24.659237  714288 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:24.687005  714288 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714292 (TID 714292) on cpu 80 from PID 711980; stack trace: ***
PC: @     0x7f19fc8fa376  (unknown)  pthread_cond_wait@@GLIBC_2.3.2
    @     0x7f18efd34cda        992  (unknown)
    @     0x7f19fc9510c0  (unknown)  (unknown)
    @                0x1  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:24.788845  714292 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:24.817159  714292 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714296 (TID 714296) on cpu 63 from PID 711980; stack trace: ***
PC: @     0x7f19fc8fa376  (unknown)  pthread_cond_wait@@GLIBC_2.3.2
    @     0x7f18efd34cda        992  (unknown)
    @     0x7f19fc9510c0  (unknown)  (unknown)
    @                0x1  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:24.933581  714296 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:24.962191  714296 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714300 (TID 714300) on cpu 87 from PID 711980; stack trace: ***
PC: @     0x7f19fc8fa376  (unknown)  pthread_cond_wait@@GLIBC_2.3.2
    @     0x7f18efd34cda        992  (unknown)
    @     0x7f19fc9510c0  (unknown)  (unknown)
    @                0x1  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:25.137587  714300 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:25.166103  714300 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714307 (TID 714307) on cpu 79 from PID 711980; stack trace: ***
PC: @     0x7f19fc8fa376  (unknown)  pthread_cond_wait@@GLIBC_2.3.2
    @     0x7f18efd34cda        992  (unknown)
    @     0x7f19fc9510c0  (unknown)  (unknown)
    @                0x1  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:25.288173  714307 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:25.316177  714307 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714311 (TID 714311) on cpu 26 from PID 711980; stack trace: ***
PC: @     0x7f19fc8fa376  (unknown)  pthread_cond_wait@@GLIBC_2.3.2
    @     0x7f18efd34cda        992  (unknown)
    @     0x7f19fc9510c0  (unknown)  (unknown)
    @                0x1  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:25.438707  714311 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:25.466978  714311 process_state.cc:771] RAW: Raising signal 15 with default behavior
2022-03-28 11:57:25.597429: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:57:25.597487: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
Traceback (most recent call last):
  File "train.py", line 655, in <module>
    train_new_model(
  File "train.py", line 238, in train_new_model
    train(
  File "train.py", line 107, in train
    trainer.fit(model, datamodule=data)
  File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in fit
    self._call_and_handle_interrupt(
  File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/xla_spawn.py", line 76, in launch
    xmp.spawn(
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 389, in spawn
    return torch.multiprocessing.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
Traceback (most recent call last):
  File "train.py", line 655, in <module>
    train_new_model(
  File "train.py", line 238, in train_new_model
    train(
  File "train.py", line 107, in train
    trainer.fit(model, datamodule=data)
  File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in fit
    self._call_and_handle_interrupt(
  File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/xla_spawn.py", line 76, in launch
    xmp.spawn(
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 389, in spawn
    return torch.multiprocessing.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
Traceback (most recent call last):
  File "train.py", line 655, in <module>
    train_new_model(
  File "train.py", line 238, in train_new_model
    train(
  File "train.py", line 107, in train
    trainer.fit(model, datamodule=data)
  File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in fit
    self._call_and_handle_interrupt(
  File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/xla_spawn.py", line 76, in launch
    xmp.spawn(
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 389, in spawn
    return torch.multiprocessing.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
shauheen commented 2 years ago

Thanks for filing this issue @ikergarcia1996

miladm commented 2 years ago

@thisisalbertliang can you please take a look at this issue?

JackCaoG commented 2 years ago

Hey @ikergarcia1996. This is a known problem and we are working on a long term fix (hopefully to land by q3). One workaround I can think of is to explicitly insert xm.mark_step between steps.

My guess is that you are currently using parallel_loader(if you are not, you should since it will upload the data async). parallel_loader will call mark_step for you to cut the graph every x batch_count and let xla compiler to compile and execute the compilation.

The issue here is that number of parameter exceed the TPU smem limit, one option to avoid this is to cut the graph more frequent which will result in smaller graph with less parameter. if batches_per_execution in your setting is already 1, you can consier add xm.mark_step between layers, or do it after forward pass so forward and backward will become 2 graph instead of 1.

JackCaoG commented 2 years ago

Another thing is is weird is you run into this issue at step 9. This seems to indicate that number of parameter increases as you training the model(since step 1-8 run without error). This is unexpected since the graph should be the same for every step otherwise it will keep recompiling. One quick way to check is use PT_XLA_DEBUG=1 env var and it will print some auto-debug message. Please checkout https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#perform-a-auto-metrics-analysis

tsuga commented 2 years ago

I have the same issue

ronghanghu commented 2 years ago

We also encountered this issue when we are trying to scale to large transformer models with FSDP in #3431

Computation requires more parameters (4748) than supported (limit 3304).

Looking forward to the long-term fix on this!

JackCaoG commented 2 years ago

The fix(runtime migration) is WIP, will update here when that is ready.

JackCaoG commented 2 years ago

FYI @will-cromar PJRT has an option to tuplify the input during compilation and execution, we should (maybe optionally) use https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/pjrt/pjrt_client.h#L222 when we switch to PJRT.