pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.45k stars 462 forks source link

Training is sooooooooo slow #1261

Closed Kyeongpil closed 4 years ago

Kyeongpil commented 4 years ago

Hello, I'm trying to train the transformer model based on PyTorch's nn.Transformer using multiprocessing with 8 TPUs.

However, the forward is super slow (even the first iteration, with 10 batch size). Each iteration takes a few minutes.

here are logs:

===============================================

2019-10-30 07:44:07.324500: I 29379 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:44:07.324563: I 29379 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:44:09.314788: I 29430 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:44:09.314850: I 29430 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:44:09.317559: I 29427 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:44:09.317618: I 29427 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:44:09.336030: I 29431 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:44:09.336093: I 29431 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:44:09.338445: I 29433 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:44:09.338496: I 29433 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:44:09.343114: I 29430 tensorflow/compiler/xla/xla_client/mesh_service.cc:168] Waiting to connect to client mesh master (300 seconds) localhost:43255 2019-10-30 07:44:09.347678: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (LOCAL) CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0 2019-10-30 07:44:09.347753: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (LOCAL) TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0 2019-10-30 07:44:09.347762: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1 2019-10-30 07:44:09.347769: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2 2019-10-30 07:44:09.347776: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3 2019-10-30 07:44:09.347783: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4 2019-10-30 07:44:09.347789: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5 2019-10-30 07:44:09.347797: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6 2019-10-30 07:44:09.347814: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7 2019-10-30 07:44:09.348037: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:221] Worker grpc://10.0.101.2:8470 for /job:tpu_worker/replica:0/task:0 2019-10-30 07:44:09.348103: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:225] XRT default device: TPU:0 2019-10-30 07:44:09.348133: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1114] Configuring TPU for master worker tpu_worker:0 at grpc://10.0.101.2:8470 2019-10-30 07:44:09.348722: I 29434 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:44:09.348768: I 29434 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:44:09.361452: I 29428 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:44:09.361517: I 29428 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:44:09.367103: I 29431 tensorflow/compiler/xla/xla_client/mesh_service.cc:168] Waiting to connect to client mesh master (300 seconds) localhost:43255 2019-10-30 07:44:09.371809: I 29433 tensorflow/compiler/xla/xla_client/mesh_service.cc:168] Waiting to connect to client mesh master (300 seconds) localhost:43255 2019-10-30 07:44:09.381630: I 29434 tensorflow/compiler/xla/xla_client/mesh_service.cc:168] Waiting to connect to client mesh master (300 seconds) localhost:43255 2019-10-30 07:44:09.383934: I 29429 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:44:09.383985: I 29429 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:44:09.392137: I 29428 tensorflow/compiler/xla/xla_client/mesh_service.cc:168] Waiting to connect to client mesh master (300 seconds) localhost:43255 2019-10-30 07:44:09.412761: I 29429 tensorflow/compiler/xla/xla_client/mesh_service.cc:168] Waiting to connect to client mesh master (300 seconds) localhost:43255 2019-10-30 07:44:09.430583: I 29432 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:44:09.430648: I 29432 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:44:09.459201: I 29432 tensorflow/compiler/xla/xla_client/mesh_service.cc:168] Waiting to connect to client mesh master (300 seconds) localhost:43255 2019-10-30 07:44:12.129402: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1125] TPU topology: mesh_shape: 2 mesh_shape: 2 mesh_shape: 2 num_tasks: 1 num_tpu_devices_per_task: 8 device_coordinates: 0 device_coordinates: 0 device_coordinates: 0 device_coordinates: 0 device_coordinates: 0 device_coordinates: 1 device_coordinates: 0 device_coordinates: 1 device_coordinates: 0 device_coordinates: 0 device_coordinates: 1 device_coordinates: 1 device_coordinates: 1 device_coordinates: 0 device_coordinates: 0 device_coordinates: 1 device_coordinates: 0 device_coordinates: 1 device_coordinates: 1 device_coordinates: 1 device_coordinates: 0 device_coordinates: 1 device_coordinates: 1 device_coordinates: 1

2019-10-30 07:44:12.129501: I 29427 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1194] Creating mesh service bound to localhost:43255 Rank: 0 Load data 2019-10-30 07:44:12.182163: I 29430 tensorflow/compiler/xla/xla_client/computation_client.cc:195] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:43255 2019-10-30 07:44:12.182913: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0 2019-10-30 07:44:12.182948: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0 2019-10-30 07:44:12.182956: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1 2019-10-30 07:44:12.182962: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2 2019-10-30 07:44:12.182969: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (LOCAL) TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3 2019-10-30 07:44:12.182974: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4 2019-10-30 07:44:12.182981: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5 2019-10-30 07:44:12.182987: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6 2019-10-30 07:44:12.182993: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7 2019-10-30 07:44:12.183000: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:221] Worker grpc://10.0.101.2:8470 for /job:tpu_worker/replica:0/task:0 2019-10-30 07:44:12.183007: I 29430 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:225] XRT default device: TPU:3 Rank: 3 Load data 2019-10-30 07:44:12.223982: I 29431 tensorflow/compiler/xla/xla_client/computation_client.cc:195] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:43255 2019-10-30 07:44:12.224456: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0 2019-10-30 07:44:12.224487: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0 2019-10-30 07:44:12.224496: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1 2019-10-30 07:44:12.224516: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2 2019-10-30 07:44:12.224530: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3 2019-10-30 07:44:12.224536: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (LOCAL) TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4 2019-10-30 07:44:12.224707: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5 2019-10-30 07:44:12.224801: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6 2019-10-30 07:44:12.224836: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7 2019-10-30 07:44:12.224846: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:221] Worker grpc://10.0.101.2:8470 for /job:tpu_worker/replica:0/task:0 2019-10-30 07:44:12.224856: I 29431 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:225] XRT default device: TPU:4 Rank: 4 Load data 2019-10-30 07:44:12.268021: I 29429 tensorflow/compiler/xla/xla_client/computation_client.cc:195] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:43255 2019-10-30 07:44:12.268547: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0 2019-10-30 07:44:12.268584: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0 2019-10-30 07:44:12.268594: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1 2019-10-30 07:44:12.268600: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (LOCAL) TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2 2019-10-30 07:44:12.268607: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3 2019-10-30 07:44:12.268623: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4 2019-10-30 07:44:12.268631: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5 2019-10-30 07:44:12.268661: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6 2019-10-30 07:44:12.268669: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7 2019-10-30 07:44:12.268680: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:221] Worker grpc://10.0.101.2:8470 for /job:tpu_worker/replica:0/task:0 2019-10-30 07:44:12.268690: I 29429 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:225] XRT default device: TPU:2 Rank: 2 Load data 2019-10-30 07:44:14.294320: I 29432 tensorflow/compiler/xla/xla_client/computation_client.cc:195] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:43255 2019-10-30 07:44:14.294793: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0 2019-10-30 07:44:14.294829: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0 2019-10-30 07:44:14.294839: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1 2019-10-30 07:44:14.294855: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2 2019-10-30 07:44:14.294867: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3 2019-10-30 07:44:14.294874: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4 2019-10-30 07:44:14.294881: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (LOCAL) TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5 2019-10-30 07:44:14.294891: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6 2019-10-30 07:44:14.294901: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7 2019-10-30 07:44:14.294912: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:221] Worker grpc://10.0.101.2:8470 for /job:tpu_worker/replica:0/task:0 2019-10-30 07:44:14.294921: I 29432 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:225] XRT default device: TPU:5 Rank: 5 Load data 2019-10-30 07:44:14.478214: I 29428 tensorflow/compiler/xla/xla_client/computation_client.cc:195] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:43255 2019-10-30 07:44:14.478892: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0 2019-10-30 07:44:14.478929: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0 2019-10-30 07:44:14.478937: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (LOCAL) TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1 2019-10-30 07:44:14.478944: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2 2019-10-30 07:44:14.478950: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3 2019-10-30 07:44:14.478956: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4 2019-10-30 07:44:14.478972: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5 2019-10-30 07:44:14.478985: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6 2019-10-30 07:44:14.478993: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7 2019-10-30 07:44:14.479008: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:221] Worker grpc://10.0.101.2:8470 for /job:tpu_worker/replica:0/task:0 2019-10-30 07:44:14.479014: I 29428 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:225] XRT default device: TPU:1 Rank: 1 Load data 2019-10-30 07:44:14.601763: I 29434 tensorflow/compiler/xla/xla_client/computation_client.cc:195] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:43255 2019-10-30 07:44:14.602318: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0 2019-10-30 07:44:14.602360: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0 2019-10-30 07:44:14.602368: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1 2019-10-30 07:44:14.602375: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2 2019-10-30 07:44:14.602381: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3 2019-10-30 07:44:14.602397: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4 2019-10-30 07:44:14.602417: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5 2019-10-30 07:44:14.602424: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6 2019-10-30 07:44:14.602430: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (LOCAL) TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7 2019-10-30 07:44:14.602438: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:221] Worker grpc://10.0.101.2:8470 for /job:tpu_worker/replica:0/task:0 2019-10-30 07:44:14.602447: I 29434 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:225] XRT default device: TPU:7 Rank: 7 Load data 2019-10-30 07:44:14.850202: I 29433 tensorflow/compiler/xla/xla_client/computation_client.cc:195] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:43255 2019-10-30 07:44:14.850964: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0 2019-10-30 07:44:14.851015: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0 2019-10-30 07:44:14.851025: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1 2019-10-30 07:44:14.851033: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2 2019-10-30 07:44:14.851042: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3 2019-10-30 07:44:14.851049: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4 2019-10-30 07:44:14.851058: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5 2019-10-30 07:44:14.851065: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (LOCAL) TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6 2019-10-30 07:44:14.851074: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:217] XRT device (REMOTE) TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7 2019-10-30 07:44:14.851094: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:221] Worker grpc://10.0.101.2:8470 for /job:tpu_worker/replica:0/task:0 2019-10-30 07:44:14.851106: I 29433 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:225] XRT default device: TPU:6 Rank: 6 Load data Build model Build model Total Parameters: 86205440 Total Parameters: 86205440 Training start - Total iter: 3506 Build model Training start - Total iter: 3506 Build model Total Parameters: 86205440 Training start - Total iter: 3506 Total Parameters: 86205440 Training start - Total iter: 3506 Build model Build model Build model Build model Total Parameters: 86205440 Training start - Total iter: 3506 Total Parameters: 86205440 Total Parameters: 86205440 Training start - Total iter: 3506 Total Parameters: 86205440 Training start - Total iter: 3506 Training start - Total iter: 3506 2019-10-30 07:45:10.059324: I 30786 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:45:10.059376: I 30786 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:45:10.232121: I 30787 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:45:10.232176: I 30787 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:45:11.010605: I 30950 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:45:11.010664: I 30950 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:45:11.117959: I 30951 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:45:11.118026: I 30951 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:45:13.167248: I 31034 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:45:13.167308: I 31034 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:45:13.341512: I 31035 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:45:13.341574: I 31035 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:45:13.549634: I 31037 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:45:13.549699: I 31037 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900 2019-10-30 07:45:13.578042: I 31070 torch_xla/csrc/aten_xla_type.cpp:86] PyTorch GIT revision: c89340f06877024a5a81393db35637a10b10568f 2019-10-30 07:45:13.578114: I 31070 torch_xla/csrc/aten_xla_type.cpp:87] XLA GIT revision: 2ddfcffd3cf91e7105bae61f7defa4591cdcb900

=============================================== There is no errors except these logs.

Do you have any idea to solve this slow training?

Kyeongpil commented 4 years ago

I found that PyTorch's forward/backward of nn.transformer is so slow. How can I solve this problem?

taylanbil commented 4 years ago

Hello,

There can be a few different issues causing a model training job to be slow on TPUs. Please refer to our api guide and troubleshooting page to get an idea on what it could be and how to start debugging.

About nn.transformer, as far as I know nobody in the TPU team tried that. However, the fairseq version of the transformer is well tested and it works fast. You can refer to the tutorial here for an example of how to train transformer on TPUs.

dlibenzi commented 4 years ago

Can you try to enable tensor core logging?

export TF_CPP_VMODULE=tensor=5

If after a few steps does not stabilize, compilation wise, it means it might have dynamic shapes.

Kyeongpil commented 4 years ago

I finally trained my model on V100 GPUs...

However, the reason of this problem might be due to this issue https://github.com/pytorch/xla/issues/1269

dlibenzi commented 4 years ago

Printing a metrics report at each step would have helped in telling why the model was slow. Could have been GELU or other ops, the report (showing aten::* counters) would have shown.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.