SIGSEGV: Segmentation Fault Memory error while checkpointing with transformers trainer on a v5-litepod-8 Google Cloud TPU

shub-kris commented 7 months ago

🐛 Bug

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. root@t1v-n-108b165f-w-0:/workspace# /usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

To Reproduce

Create and SSH intoGoogle Cloud VM:

 gcloud alpha compute tpus tpu-vm create tpu-vm --zone=us-west4-a --accelerator-type=v5litepod-8 --version v2-alpha-tpuv5-lite
 gcloud alpha compute tpus tpu-vm ssh tpu-vm --zone=us-west4-a

Install the packages

pip install torch~=2.1.0 torch_xla[tpu]~=2.1.0 -f https://storage.googleapis.com/libtpu-releases/index.html
pip install transformers==4.37.2 accelerate==0.27.0 datasets==2.16.1

Run the test-transformers-trainer.py with

export PJRT_DEVICE=TPU
python test-transformers-trainer.py --save_steps 100 --no_gradient_checkpointing

Entire Stack Trace

``` WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU. WARNING:root:Unsupported nprocs (8), ignoring... WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1708967984.563769 202 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1708967984.599691 206 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1708967984.611669 205 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1708967984.672515 208 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1708967984.780619 211 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1708967984.946702 210 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1708967984.953963 207 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1708967985.202261 209 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so I0000 00:00:1708967998.689161 206 pjrt_c_api_client.cc:110] PjRtCApiClient created. I0000 00:00:1708967998.754534 211 pjrt_c_api_client.cc:110] PjRtCApiClient created. I0000 00:00:1708967998.778158 205 pjrt_c_api_client.cc:110] PjRtCApiClient created. I0000 00:00:1708967998.852075 210 pjrt_c_api_client.cc:110] PjRtCApiClient created. I0000 00:00:1708967998.912706 209 pjrt_c_api_client.cc:110] PjRtCApiClient created. I0000 00:00:1708967998.963000 202 pjrt_c_api_client.cc:110] PjRtCApiClient created. I0000 00:00:1708967998.963753 208 pjrt_c_api_client.cc:110] PjRtCApiClient created. I0000 00:00:1708967998.980441 207 pjrt_c_api_client.cc:110] PjRtCApiClient created. Downloading readme: 100%|████████████████████████████| 8.20k/8.20k [00:00<00:00, 26.9MB/s] Downloading data: 100%|██████████████████████████████| 13.1M/13.1M [00:00<00:00, 17.7MB/s] Generating train split: 15011 examples [00:00, 424303.49 examples/s] Map: 100%|████████████████████████████████| 15011/15011 [00:00<00:00, 16078.72 examples/s] Map: 100%|████████████████████████████████| 15011/15011 [00:00<00:00, 16078.77 examples/s] Map: 100%|████████████████████████████████| 15011/15011 [00:01<00:00, 14403.63 examples/s] tokenizer_config.json: 100%|█████████████████████████████| 685/685 [00:00<00:00, 5.92MB/s] config.json: 100%|███████████████████████████████████████| 651/651 [00:00<00:00, 5.21MB/s] Map: 100%|████████████████████████████████| 15011/15011 [00:01<00:00, 11107.38 examples/s] Map: 100%|████████████████████████████████| 15011/15011 [00:01<00:00, 10320.80 examples/s] Map: 100%|████████████████████████████████| 15011/15011 [00:01<00:00, 10428.84 examples/s] Map: 100%|█████████████████████████████████| 15011/15011 [00:01<00:00, 9895.84 examples/s] Map: 100%|█████████████████████████████████| 15011/15011 [00:01<00:00, 9751.13 examples/s] vocab.json: 100%|██████████████████████████████████████| 899k/899k [00:00<00:00, 6.72MB/s] merges.txt: 100%|██████████████████████████████████████| 456k/456k [00:00<00:00, 1.17MB/s] special_tokens_map.json: 100%|███████████████████████████| 441/441 [00:00<00:00, 4.28MB/s] Map: 100%|█████████████████████████████████| 15011/15011 [00:09<00:00, 1660.92 examples/s] Map: 100%|█████████████████████████████████| 15011/15011 [00:09<00:00, 1549.74 examples/s] Map: 100%|█████████████████████████████████| 15011/15011 [00:09<00:00, 1516.41 examples/s] Map: 100%|█████████████████████████████████| 15011/15011 [00:10<00:00, 1403.17 examples/s] Map: 100%|█████████████████████████████████| 15011/15011 [00:11<00:00, 1343.44 examples/s] Map: 100%|█████████████████████████████████| 15011/15011 [00:11<00:00, 1343.12 examples/s] Map: 100%|█████████████████████████████████| 15011/15011 [00:11<00:00, 1338.69 examples/s] Map: 100%|█████████████████████████████████| 15011/15011 [00:11<00:00, 1343.30 examples/s] pytorch_model.bin: 100%|████████████████████████████████| 251M/251M [00:01<00:00, 128MB/s] /usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() generation_config.json: 100%|█████████████████████████████| 137/137 [00:00<00:00, 368kB/s] {'loss': 7.3, 'learning_rate': 0.0007774011299435029, 'epoch': 0.08} {'loss': 6.275, 'learning_rate': 0.0007548022598870056, 'epoch': 0.17} {'loss': 5.775, 'learning_rate': 0.0007322033898305085, 'epoch': 0.25} {'loss': 5.225, 'learning_rate': 0.0007096045197740113, 'epoch': 0.34} {'loss': 5.05, 'learning_rate': 0.0006870056497175141, 'epoch': 0.42} {'loss': 4.825, 'learning_rate': 0.0006644067796610169, 'epoch': 0.51} {'loss': 4.7, 'learning_rate': 0.0006418079096045198, 'epoch': 0.59} {'loss': 4.6, 'learning_rate': 0.0006192090395480226, 'epoch': 0.68} {'loss': 4.5, 'learning_rate': 0.0005966101694915254, 'epoch': 0.76} {'loss': 4.5, 'learning_rate': 0.0005740112994350283, 'epoch': 0.85} 28%|██████████████▍ | 100/354 [01:52<02:07, 1.99it/s]https://symbolize.stripped_domain/r/?trace=7f0a62182524,7f0d36c36d5f,7f0a6206407e,7f0a62053a2d,7f0d36f4e6ad,7f0d370fdfff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f0a5d943000-7f0a6bf45ac0 *** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 205 (TID 5162) on cpu 27; stack trace: *** https://symbolize.stripped_domain/r/?trace=7fcf3e047524,7fd212afbd5f,7fcf3df2907e,7fcf3df18a2d,7fd212e136ad,7fd212fc2fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7fcf39808000-7fcf47e0aac0 *** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 206 (TID 5122) on cpu 11; stack trace: *** https://symbolize.stripped_domain/r/?trace=7f8e73790524,7f9148244d5f,7f8e7367207e,7f8e73661a2d,7f914855c6ad,7f914870bfff&map=https://symbolize.stripped_domain/r/?trace=https://symbolize.stripped_domain/r/?trace=7fe0d83ee524,7fc56e9e8524,7fe3acea2d5f,https://symbolize.stripped_domain/r/?trace=7fc84349cd5f,7f57575b9524,7fe0d82d007e,bdcf1f91b8790c8a971d2904a194674945111543:7f8e6ef51000-7f8e7d553ac07fc56e8ca07e,7f5a2c06dd5f,7fe0d82bfa2d,7fc56e8b9a2d,7f575749b07e,7fe3ad1ba6ad,7fc8437b46ad,7f575748aa2d,7fe3ad369fff7fc843963fff 7f5a2c3856ad,&map=&map=*** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 211 (TID 5143) on cpu 87; stack trace: *** 7f5a2c534fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f5752d7a000-7f576137cac0bdcf1f91b8790c8a971d2904a194674945111543:7fe0d3baf000-7fe0e21b1ac0bdcf1f91b8790c8a971d2904a194674945111543:7fc56a1a9000-7fc5787abac0 *** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 202 (TID 5238) on cpu 165; stack trace: *** *** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 208 (TID 5244) on cpu 184; stack trace: *** *** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 209 (TID 5204) on cpu 0; stack trace: *** https://symbolize.stripped_domain/r/?trace=7f68745ea524,7f6b4909ed5f,7f68744cc07e,7f68744bba2d,7f6b493b66ad,7f6b49565fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f686fdab000-7f687e3adac0 *** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 210 (TID 5182) on cpu 56; stack trace: *** https://symbolize.stripped_domain/r/?trace=7f9495bd1524,7f976a685d5f,7f9495ab307e,7f9495aa2a2d,7f976a99d6ad,7f976ab4cfff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f9491392000-7f949f994ac0 *** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 207 (TID 5259) on cpu 38; stack trace: *** PC: @ 0x7f0a62182524 (unknown) torch_xla::tensor_methods::all_reduce() @ 0x7f09ab75621a 1152 (unknown) @ 0x7f0d36c36d60 1648 (unknown) PC: @ 0x7fcf3e047524 (unknown) torch_xla::tensor_methods::all_reduce() @ 0x7fce8761921a 1152 (unknown) @ 0x7fd212afbd60 1648 (unknown) PC: @ 0x7fc56e9e8524 (unknown) torch_xla::tensor_methods::all_reduce() @ 0x7fc4b7fbd21a 1152 (unknown) @ 0x7fc84349cd60 1648 (unknown) PC: @ 0x7f57575b9524 (unknown) torch_xla::tensor_methods::all_reduce() @ 0x7f56a0b8821a 1152 (unknown) @ 0x7f5a2c06dd60 1648 (unknown) PC: @ 0x7f8e73790524 (unknown) torch_xla::tensor_methods::all_reduce() PC: @ 0x7fe0d83ee524 (unknown) torch_xla::tensor_methods::all_reduce() @ 0x7f8dbcd6021a 1152 (unknown) @ 0x7fe0219c421a 1152 (unknown) @ 0x7f9148244d60 1648 (unknown) @ 0x7fe3acea2d60 1648 (unknown) PC: @ 0x7f9495bd1524 (unknown) torch_xla::tensor_methods::all_reduce() @ 0x7f93df19e21a 1152 (unknown) @ 0x7f976a685d60 1648 (unknown) PC: @ 0x7f68745ea524 (unknown) torch_xla::tensor_methods::all_reduce() @ 0x7f67bdbbb21a 1152 (unknown) @ 0x7f6b4909ed60 1648 (unknown) @ 0x7f0a6206407f 256 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() @ 0x7fcf3df2907f 256 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() @ 0x7fc56e8ca07f 256 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() @ 0x7f575749b07f 256 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() @ 0x7fe0d82d007f 256 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() @ 0x7f9495ab307f 256 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() @ 0x7f68744cc07f 256 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() @ 0x7f8e7367207f 256 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN() @ 0x7f0a62053a2e 512 pybind11::cpp_function::dispatcher() @ 0x7f0d36f4e6ae (unknown) cfunction_call @ 0x7f0d370fe000 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7f0a62182524,7f09ab756219,7f0d36c36d5f,7f0a6206407e,7f0a62053a2d,7f0d36f4e6ad,7f0d370fdfff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f0a5d943000-7f0a6bf45ac0,bd189fb7b9de62cf44fe27cae177f396:7f099e14c000-7f09ab967670 E0226 17:22:24.766103 5162 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0226 17:22:24.766116 5162 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0226 17:22:24.766120 5162 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0226 17:22:24.766159 5162 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0226 17:22:24.766165 5162 coredump_hook.cc:603] RAW: Dumping core locally. @ 0x7fcf3df18a2e 512 pybind11::cpp_function::dispatcher() @ 0x7fd212e136ae (unknown) cfunction_call @ 0x7fd212fc3000 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7fcf3e047524,7fce87619219,7fd212afbd5f,7fcf3df2907e,7fcf3df18a2d,7fd212e136ad,7fd212fc2fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7fcf39808000-7fcf47e0aac0,bd189fb7b9de62cf44fe27cae177f396:7fce7a00f000-7fce8782a670 E0226 17:22:24.778616 5122 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0226 17:22:24.778630 5122 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0226 17:22:24.778634 5122 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0226 17:22:24.778661 5122 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0226 17:22:24.778667 5122 coredump_hook.cc:603] RAW: Dumping core locally. @ 0x7fc56e8b9a2e 512 pybind11::cpp_function::dispatcher() @ 0x7fc8437b46ae (unknown) cfunction_call @ 0x7f575748aa2e 512 pybind11::cpp_function::dispatcher() @ 0x7fe0d82bfa2e 512 pybind11::cpp_function::dispatcher() @ 0x7f9495aa2a2e 512 pybind11::cpp_function::dispatcher() @ 0x7f5a2c3856ae 516988624 cfunction_call @ 0x7f68744bba2e 512 pybind11::cpp_function::dispatcher() @ 0x7fc843964000 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7fc56e9e8524,7fc4b7fbd219,7fc84349cd5f,7fc56e8ca07e,7fc56e8b9a2d,7fc8437b46ad,7fc843963fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7fc56a1a9000-7fc5787abac0,bd189fb7b9de62cf44fe27cae177f396:7fc4aa9b3000-7fc4b81ce670 E0226 17:22:24.800109 5204 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0226 17:22:24.800122 5204 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0226 17:22:24.800126 5204 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0226 17:22:24.800155 5204 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0226 17:22:24.800160 5204 coredump_hook.cc:603] RAW: Dumping core locally. @ 0x7f8e73661a2e 512 pybind11::cpp_function::dispatcher() @ 0x7fe3ad1ba6ae (unknown) cfunction_call @ 0x7f976a99d6ae (unknown) cfunction_call @ 0x7f6b493b66ae (unknown) cfunction_call @ 0x7f914855c6ae (unknown) cfunction_call @ 0x7f5a2c535000 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7f57575b9524,7f56a0b88219,7f5a2c06dd5f,7f575749b07e,7f575748aa2d,7f5a2c3856ad,7f5a2c534fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f5752d7a000-7f576137cac0,bd189fb7b9de62cf44fe27cae177f396:7f569357e000-7f56a0d99670 E0226 17:22:24.802639 5238 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0226 17:22:24.802654 5238 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0226 17:22:24.802658 5238 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0226 17:22:24.802683 5238 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0226 17:22:24.802689 5238 coredump_hook.cc:603] RAW: Dumping core locally. @ 0x7fe3ad36a000 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7fe0d83ee524,7fe0219c4219,7fe3acea2d5f,7fe0d82d007e,7fe0d82bfa2d,7fe3ad1ba6ad,7fe3ad369fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7fe0d3baf000-7fe0e21b1ac0,bd189fb7b9de62cf44fe27cae177f396:7fe0143ba000-7fe021bd5670 E0226 17:22:24.804879 5244 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0226 17:22:24.804890 5244 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0226 17:22:24.804894 5244 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0226 17:22:24.804918 5244 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0226 17:22:24.804924 5244 coredump_hook.cc:603] RAW: Dumping core locally. @ 0x7f976ab4d000 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7f9495bd1524,7f93df19e219,7f976a685d5f,7f9495ab307e,7f9495aa2a2d,7f976a99d6ad,7f976ab4cfff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f9491392000-7f949f994ac0,bd189fb7b9de62cf44fe27cae177f396:7f93d1b94000-7f93df3af670 E0226 17:22:24.804975 5259 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0226 17:22:24.804988 5259 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0226 17:22:24.804992 5259 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0226 17:22:24.805019 5259 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0226 17:22:24.805025 5259 coredump_hook.cc:603] RAW: Dumping core locally. @ 0x7f6b49566000 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7f68745ea524,7f67bdbbb219,7f6b4909ed5f,7f68744cc07e,7f68744bba2d,7f6b493b66ad,7f6b49565fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f686fdab000-7f687e3adac0,bd189fb7b9de62cf44fe27cae177f396:7f67b05b1000-7f67bddcc670 E0226 17:22:24.805818 5182 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0226 17:22:24.805833 5182 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0226 17:22:24.805838 5182 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0226 17:22:24.805860 5182 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0226 17:22:24.805867 5182 coredump_hook.cc:603] RAW: Dumping core locally. @ 0x7f914870c000 (unknown) (unknown) https://symbolize.stripped_domain/r/?trace=7f8e73790524,7f8dbcd60219,7f9148244d5f,7f8e7367207e,7f8e73661a2d,7f914855c6ad,7f914870bfff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f8e6ef51000-7f8e7d553ac0,bd189fb7b9de62cf44fe27cae177f396:7f8daf756000-7f8dbcf71670 E0226 17:22:24.806177 5143 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0226 17:22:24.806190 5143 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0226 17:22:24.806195 5143 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0226 17:22:24.806215 5143 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0226 17:22:24.806220 5143 coredump_hook.cc:603] RAW: Dumping core locally. E0226 17:24:12.684363 5122 process_state.cc:783] RAW: Raising signal 11 with default behavior https://symbolize.stripped_domain/r/?trace=7f0d36bea7b2,7f0d36c36d5f,1bd&map= E0226 17:24:19.102171 9260 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7f0d36bea7b2 while already in FailureSignalHandler! E0226 17:24:19.102188 9260 process_state.cc:1077] RAW: tid: 9260 raised new signal (old_tid: 5162) https://symbolize.stripped_domain/r/?trace=7f91481f87b2,7f9148244d5f&map= E0226 17:24:19.104284 1072 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7f91481f87b2 while already in FailureSignalHandler! E0226 17:24:19.104320 1072 process_state.cc:1077] RAW: tid: 1072 raised new signal (old_tid: 5143) https://symbolize.stripped_domain/r/?trace=7f6b490527b2,7f6b4909ed5f,1bd&map= E0226 17:24:19.104958 8369 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7f6b490527b2 while already in FailureSignalHandler! E0226 17:24:19.104998 8369 process_state.cc:1077] RAW: tid: 8369 raised new signal (old_tid: 5182) https://symbolize.stripped_domain/r/?trace=7f976a6397b2,7f976a685d5f&map= E0226 17:24:19.105265 373 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7f976a6397b2 while already in FailureSignalHandler! E0226 17:24:19.105296 373 process_state.cc:1077] RAW: tid: 373 raised new signal (old_tid: 5259) https://symbolize.stripped_domain/r/?trace=7fe021a31cbb,7fe3acea2d5f,7fe0219c67a7,7fe021a34a33,7fe0214874d3,7fe021486fab,7fe021486979,7fe0218f309e,7fe3ace4fea6&map=bd189fb7b9de62cf44fe27cae177f396:7fe0143ba000-7fe021bd5670 https://symbolize.stripped_domain/r/?trace=E0226 17:24:19.105780 10617 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7fe021a31cbb while already in FailureSignalHandler! E0226 17:24:19.105823 10617 process_state.cc:1077] RAW: tid: 10617 raised new signal (old_tid: 5244) 7f56a0bf5cbb,7f5a2c06dd5f,7f56a0b8a7a7,7f56a0bf8a33,7f56a064b4d3,7f56a064afab,7f56a064a979,7f56a0ab709e,7f5a2c01aea6&map=bd189fb7b9de62cf44fe27cae177f396:7f569357e000-7f56a0d99670 E0226 17:24:19.105857 10728 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7f56a0bf5cbb while already in FailureSignalHandler! E0226 17:24:19.105896 10728 process_state.cc:1077] RAW: tid: 10728 raised new signal (old_tid: 5238) https://symbolize.stripped_domain/r/?trace=7fc4b802acbb,7fc84349cd5f,7fc4b7fbf7a7,7fc4b802da33,7fc4b7a804d3,7fc4b7a7ffab,7fc4b7a7f979,7fc4b7eec09e,7fc843449ea6&map=bd189fb7b9de62cf44fe27cae177f396:7fc4aa9b3000-7fc4b81ce670 E0226 17:24:19.107415 11461 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7fc4b802acbb while already in FailureSignalHandler! E0226 17:24:19.107458 11461 process_state.cc:1077] RAW: tid: 11461 raised new signal (old_tid: 5204) E0226 17:24:20.037362 5162 process_state.cc:783] RAW: Raising signal 11 with default behavior E0226 17:24:20.085641 5259 process_state.cc:783] RAW: Raising signal 11 with default behavior E0226 17:24:20.102202 5143 process_state.cc:783] RAW: Raising signal 11 with default behavior E0226 17:24:20.105324 5182 process_state.cc:783] RAW: Raising signal 11 with default behavior E0226 17:24:20.105619 5244 process_state.cc:783] RAW: Raising signal 11 with default behavior E0226 17:24:20.108065 5204 process_state.cc:783] RAW: Raising signal 11 with default behavior E0226 17:24:20.113516 4802 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked. E0226 17:24:20.113546 4802 coredump_hook.cc:486] RAW: Called via ReportEvent and disabled coredump E0226 17:24:20.113552 4802 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec. E0226 17:24:20.113555 4802 coredump_hook.cc:542] RAW: Sending fingerprint to remote end. E0226 17:24:20.113584 4802 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory E0226 17:24:20.113589 4802 coredump_hook.cc:603] RAW: Dumping core locally. E0226 17:24:20.119348 5238 process_state.cc:783] RAW: Raising signal 11 with default behavior Traceback (most recent call last): File "/workspace/dolly-clm.py", line 102, in xmp.spawn(_mp_fn, args=(args,), nprocs=args.num_cores) File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 83, in wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 38, in spawn return pjrt.spawn(fn, nprocs, start_method, args) File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 202, in spawn run_multiprocess(spawn_fn, start_method=start_method) File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 83, in wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 159, in run_multiprocess replica_results = list( File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 160, in itertools.chain.from_iterable( File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists for element in iterable: File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. root@t1v-n-108b165f-w-0:/workspace# /usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' ```

Expected behavior

The code should save the checkpoints successfully.

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
torch_xla version: 2.1.0

JackCaoG commented 7 months ago

seems like it crashed in https://github.com/pytorch/xla/blob/cb4983e93d70319db56440872567e2dc98d0ce1f/torch_xla/csrc/tensor_methods.cpp#L354-L370 ...

@will-cromar can you take a look?

shub-kris commented 7 months ago

@alanwaketan can you please also have a look here?

will-cromar commented 7 months ago

@alanwaketan do you normally use the HuggingFace Trainer? I remember people have had issues using it with XLA before. I ran through two of the example tutorials last week while working on #6584, and the Trainer-based examples had issues on TPU, but the accelerate-based examples did work fine.

I tried to reproduce your crash on v4-8 with torch and torch_xla built from head and got a different crash: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 32.45M of 16.00M vmem. Exceeded vmem capacity by 16.45M.

alanwaketan commented 7 months ago

I do believe the normal torch.save should be compatible with FSDP. cc @jonb377 who is our ckpt expert.

alanwaketan commented 7 months ago

@alanwaketan do you normally use the HuggingFace Trainer? I remember people have had issues using it with XLA before. I ran through two of the example tutorials last week, and the Trainer-based ones had issues on TPU, but the accelerate-based examples and scripts did work fine.

I tried to reproduce your crash on v4-8 with torch and torch_xla built from head and got a different crash: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 32.45M of 16.00M vmem. Exceeded vmem capacity by 16.45M.

Yea, I do. All the Llama and Gemma works are done with HF trainer. But I don't recall we hit this issue before.

alanwaketan commented 7 months ago

Okay, I just scanned through the script and it looks like it has nothing to do with SPMD @jonb377. It’s probably just simple DP… Have no ideas why this will crash but we probably won’t be able to spend too much time on debugging this given mp is about to deprecate.

Mon-ius commented 6 months ago

it also crash normally with Phi2, and even SD, tested on the TPUv4-8 :(

alanwaketan commented 6 months ago

it also crash normally with Phi2, and even SD, tested on the TPUv4-8 :(

Do you use DP or FSDP?

Mon-ius commented 6 months ago

hi @alanwaketan

I think it is high related to the HF accelerate lib, will continue verification

himekifee commented 4 months ago

@alanwaketan do you normally use the HuggingFace Trainer? I remember people have had issues using it with XLA before. I ran through two of the example tutorials last week while working on #6584, and the Trainer-based examples had issues on TPU, but the accelerate-based examples did work fine.

I tried to reproduce your crash on v4-8 with torch and torch_xla built from head and got a different crash: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 32.45M of 16.00M vmem. Exceeded vmem capacity by 16.45M.

Hi. I encountered exact same issue as you did; even the vmem numbers are the exact same, and I tested with different llm with generate(), all hitting the same issue. Have you found a way to solve that?

baoleai commented 4 months ago

Hello, @shub-kris , I encountered a similar issue and have fixed it in https://github.com/huggingface/transformers/pull/31264. Could you check if your issue has been resolved?

pytorch / xla