huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.7k stars 937 forks source link

NLP example script not working with num_processes=8 #1302

Closed mo-soliman closed 1 year ago

mo-soliman commented 1 year ago

System Info

- `Accelerate` version: 0.18.0.dev0
- Platform: Linux-5.13.0-1027-gcp-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.2
- PyTorch version (GPU?): 2.0.0+cu117 (False)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: TPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

I created a TPU VM v2-8 (From google), when running this example script https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py It fails and gives this error:

https://symbolize.stripped_domain/r/?trace=https://symbolize.stripped_domain/r/?trace=7f4fbf0fc873,7f356bff2873,7f505752408f7f360441a08f&map=&map= 
*** SIGSEGV (@0x8), see gl__________46#s15 received by PID 5397 (TID 7308) on cpu 82; stack trace: ***

*** SIGSEGV (@0x87), see gl__________46#s15 received by PID 5400 (TID 7955) on cpu 45; stack trace: ***
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/parallel_loader.py", line 140, in _loader_worker

PC: @     0x7f356bff2873  (unknown)  torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal()
PC: @     0x7f4fbf0fc873  (unknown)  torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal()
    @     0x7f34b5adaa1a       1152  (unknown)
    @     0x7f4f08be4a1a       1152  (unknown)
    @     0x7f360441a090  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f356bff2873,7f34b5adaa19,    @     0x7f5057524090  (unknown)  (unknown)
7f360441a08f&map=https://symbolize.stripped_domain/r/?trace=ceee8fa20ddf9c34af43f587221e91de:7f34a8bb2000-7f34b5cf18407f4fbf0fc873,7f4f08be4a19,7f505752408f&map= 
ceee8fa20ddf9c34af43f587221e91de:7f4efbcbc000-7f4f08dfb840E0408 07:37:46.418757    7308 coredump_hook.cc:414] RAW: Remote crash data gathering hook invoked.

E0408 07:37:46.418781    7308 coredump_hook.cc:453] RAW: Skipping coredump since rlimit was 0 at process start.
E0408 07:37:46.418809    7308 client.cc:278] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0408 07:37:46.418802    7955 coredump_hook.cc:414] RAW: Remote crash data gathering hook invoked.
E0408 07:37:46.418820    7308 coredump_hook.cc:512] RAW: Sending fingerprint to remote end.
E0408 07:37:46.418827    7308 coredump_socket.cc:120] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0408 07:37:46.418834    7308 coredump_hook.cc:518] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0408 07:37:46.418837    7955 coredump_hook.cc:453] RAW: Skipping coredump since rlimit was 0 at process start.
E0408 07:37:46.418843    7308 coredump_hook.cc:580] RAW: Dumping core locally.
E0408 07:37:46.418852    7955 client.cc:278] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0408 07:37:46.418864    7955 coredump_hook.cc:512] RAW: Sending fingerprint to remote end.
E0408 07:37:46.418877    7955 coredump_socket.cc:120] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0408 07:37:46.418901    7955 coredump_hook.cc:518] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0408 07:37:46.418908    7955 coredump_hook.cc:580] RAW: Dumping core locally.

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/parallel_loader.py", line 140, in _loader_worker
    _, data = next(data_iter)
  File "/home/soliman/.local/lib/python3.8/site-packages/accelerate/data_loader.py", line 369, in __iter__
    synchronize_rng_states(self.rng_types, self.synchronized_generator)
  File "/home/soliman/.local/lib/python3.8/site-packages/accelerate/utils/random.py", line 89, in synchronize_rng_states
    synchronize_rng_state(RNGType(rng_type), generator=generator)
  File "/home/soliman/.local/lib/python3.8/site-packages/accelerate/utils/random.py", line 68, in synchronize_rng_state
    rng_state = xm.mesh_reduce("random_seed", rng_state, lambda x: x[0])
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 1160, in mesh_reduce
    xdata = rendezvous(tag, bio.getvalue())
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 1110, in rendezvous
    return pjrt.rendezvous(tag, payload, replicas or None)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/experimental/pjrt.py", line 419, in rendezvous
    xm.mark_step()
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 949, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: /pytorch/xla/torch_xla/csrc/xla_graph_executor.cpp:523 : Check failed: tensor_data 
*** Begin stack trace ***
        tsl::CurrentStackTrace()
        torch_xla::XLAGraphExecutor::CollectSyncTensors(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > > const&, torch::lazy::LazyGraphExecutor::SyncTensorsConfig const&)
        torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20220623::Span<std::string const>, torch::lazy::LazyGraphExecutor::SyncTensorsConfig const&, bool)
        torch_xla::XLAGraphExecutor::SyncTensorsGraph(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20220623::Span<std::string const>, bool, bool, bool)
        torch_xla::XLAGraphExecutor::SyncLiveTensorsGraph(torch::lazy::BackendDevice const*, c10::ArrayRef<std::string>, bool)

        PyCFunction_Call
        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall

        PyObject_Call
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault
        _PyFunction_Vectorcall

        PyObject_Call

        clone
*** End stack trace ***

E0408 07:37:46.664046    7308 process_state.cc:784] RAW: Raising signal 11 with default behavior
E0408 07:37:46.721980    7955 process_state.cc:784] RAW: Raising signal 11 with default behavior
https://symbolize.stripped_domain/r/?trace=7fa983018454,7fa98306c08f&map= 
*** SIGTERM received by PID 5399 (TID 5399) on cpu 38 from PID 5214; stack trace: ***
https://symbolize.stripped_domain/r/?trace=7feafccfa454,7feafcd4e08f&map= 
*** SIGTERM received by PID 5398 (TID 5398) on cpu 95 from PID 5214; stack trace: ***
PC: @     0x7fa983018454  (unknown)  do_futex_wait.constprop.0
    @     0x7fa83472ca1a       1152  (unknown)
PC: @     0x7feafccfa454  (unknown)  do_futex_wait.constprop.0
    @     0x7fe9ae40ea1a       1152  (unknown)
    @     0x7fa98306c090  (unknown)  (unknown)
    @ ... and at least 1 more frames
https://symbolize.stripped_domain/r/?trace=7fa983018454,7fa83472ca19,7fa98306c08f&map=ceee8fa20ddf9c34af43f587221e91de:7fa827804000-7fa834943840 
E0408 07:37:48.470860    5399 coredump_hook.cc:360] RAW: Remote crash gathering disabled for SIGTERM.
    @     0x7feafcd4e090  (unknown)  (unknown)
    @ ... and at least 1 more frames
https://symbolize.stripped_domain/r/?trace=7feafccfa454,7fe9ae40ea19,7feafcd4e08f&map=ceee8fa20ddf9c34af43f587221e91de:7fe9a14e6000-7fe9ae625840 
E0408 07:37:48.470932    5398 coredump_hook.cc:360] RAW: Remote crash gathering disabled for SIGTERM.
E0408 07:37:48.655386    5398 process_state.cc:784] RAW: Raising signal 15 with default behavior
E0408 07:37:48.672290    5399 process_state.cc:784] RAW: Raising signal 15 with default behavior
Traceback (most recent call last):
  File "/home/soliman/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/soliman/.local/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/soliman/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 930, in launch_command
    tpu_launcher(args)
  File "/home/soliman/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 694, in tpu_launcher
    xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 386, in spawn
    return pjrt.spawn(fn, nprocs, start_method, args)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/experimental/pjrt.py", line 365, in spawn
    _run_multiprocess(spawn_fn, start_method=start_method)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/experimental/pjrt.py", line 92, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/experimental/pjrt.py", line 322, in _run_multiprocess
    replica_results = list(
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/experimental/pjrt.py", line 323, in <genexpr>
    itertools.chain.from_iterable(
  File "/usr/lib/python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

It's worth noting: 1) When running the same script but with num_processes=1 in the tpu configuration, it works normally. 2) When running another example from google documentation (training a ResNet, with num_processes=8) it works normally (You can find it here: https://cloud.google.com/tpu/docs/pytorch-xla-ug-tpu-vm#changing_pytorch_version) git clone --recursive https://github.com/pytorch/xla.git python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

Setting these environment variables didn't help

export XRT_TPU_CONFIG="localservice;0;localhost:51011"
export PJRT_DEVICE="TPU"

What is the cause and how to fix this error? Thanks

Expected behavior

Training with no errors

neel04 commented 1 year ago

Similar problem here: https://discuss.huggingface.co/t/kaggle-tpuvm-doesnt-allow-setting-nprocs-1/35999/2 @muellerzr

Again on TPUs. This can be reproduced really easily in kaggle kernels just with accelerate test --config_file ... so is not model dependent. When using my own model & script for training I get the same error - so the problem is definitely here with XLA or accelerate

neel04 commented 1 year ago

Any updates? I can't really train my scripts in the meantime :(

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.