huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.95k stars 967 forks source link

Error running notebook launcher in google Colab #3126

Open emusiienko opened 1 month ago

emusiienko commented 1 month ago

System Info

- `Accelerate` version: 0.35.0.dev0
- Platform: Linux-6.1.85+-x86_64-with-glibc2.35
- `accelerate` bash location: /usr/local/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.0+cpu (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 334.56 GB
- `Accelerate` default config:
    Not found

Information

Tasks

Reproduction

Steps:

  1. Open the notebook by the link https://colab.research.google.com/drive/1IfpvFhqwQdveKEzFraKzbTtQE1a0FEYs?usp=drive_link (it should be shared) in google colab TPU environment. Basically it contains the code of training causal model from the HF course https://huggingface.co/learn/nlp-course/chapter7/6?fw=pt but adopted for TPU.
  2. Run it, It's better to run the cells one by one until reach the cell with notebook_launcher
  3. Run the cell with notebook_launcher

Result: The function starts working, shows some initial progress and crashes. I'm not sure if it's a bug or wrong configuration, I tried to set env variables like that:

os.environ["TPU_NAME"] = "dummy"

os.environ['PJRT_DEVICE'] = 'TPU'

os.environ['TPU_NUM_DEVICES'] = '8'

make the TPU available accelerator to torch-xla

os.environ["XRT_TPU_CONFIG"]="localservice;0;localhost:51011"

However it doesn't seem to have effect.

The crash log:

WARNING:root:Unsupported nprocs (8), ignoring... Launching a training on 8 TPU cores. /usr/local/lib/python3.10/dist-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn(   0%  1/205 [00:00<01:07,  3.03it/s]   0%  1/205 [00:00<01:05,  3.14it/s]   1%  2/205 [00:24<48:52, 14.44s/it]   1%  2/205 [00:23<46:08, 13.64s/it]   0%  1/205 [00:00<01:05,  3.14it/s]   1%  2/205 [00:25<50:14, 14.85s/it]   0%  1/205 [00:00<01:08,  2.96it/s]   1%  2/205 [00:23<46:45, 13.82s/it]

_RemoteTraceback Traceback (most recent call last) _RemoteTraceback:

Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, call_item.kwargs) File "/usr/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk return [fn(args) for args in chunk] File "/usr/lib/python3.10/concurrent/futures/process.py", line 205, in return [fn(args) for args in chunk] File "/usr/local/lib/python3.10/dist-packages/torch_xla/runtime.py", line 95, in wrapper return fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 78, in _run_thread_per_device replica_results = list( File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 71, in _thread_fn return fn() File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 190, in call self.fn(runtime.global_ordinal(), *self.args, *self.kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/launch.py", line 674, in call self.launcher(args) File "/content/train_func_2.py", line 65, in training_function accelerator.backward(loss) File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2237, in backward loss.backward(**kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 521, in backward torch.autograd.backward( File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 289, in backward _engine_run_backward( File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 768, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: torch_xla/csrc/tensor.cpp:191 : Check failed: data()->tensor_data

Begin stack trace tsl::CurrentStackTrace() torch_xla::XLATensor::shape() const torch_xla::XLATensorImpl::SetupSizeProperties() torch_xla::XLATensorImpl::sym_sizes_custom() const at::FunctionalTensorWrapper::sym_sizes_custom() const at::_ops::addTensor::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, c10::Scalar const&) at::_ops::add__Tensor::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, c10::Scalar const&) at::_ops::addTensor::call(at::Tensor&, at::Tensor const&, c10::Scalar const&) torch::autograd::AccumulateGrad::apply(std::vector<at::Tensor, std::allocator >&&) torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr const&) torch::autograd::Engine::thread_main(std::shared_ptr const&) torch::autograd::Engine::thread_init(int, std::shared_ptr const&, bool) torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr const&, bool)

End stack trace

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last) in <cell line: 13>() 11 #os.environ["XRT_TPU_CONFIG"]="localservice;0;localhost:51011" 12 ---> 13 notebook_launcher(training_function, (model, tokenized_datasets), mixed_precision="bf16")

11 frames /usr/lib/python3.10/concurrent/futures/_base.py in __get_result(self) 401 if self._exception: 402 try: --> 403 raise self._exception 404 finally: 405 # Break a reference cycle with the exception in self._exception

RuntimeError: torch_xla/csrc/tensor.cpp:191 : Check failed: data()->tensor_data Begin stack trace tsl::CurrentStackTrace() torch_xla::XLATensor::shape() const torch_xla::XLATensorImpl::SetupSizeProperties() torch_xla::XLATensorImpl::sym_sizes_custom() const at::FunctionalTensorWrapper::sym_sizes_custom() const

at::_ops::add__Tensor::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, c10::Scalar const&)

at::_ops::add__Tensor::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, c10::Scalar const&)

at::_ops::add__Tensor::call(at::Tensor&, at::Tensor const&, c10::Scalar const&)

torch::autograd::AccumulateGrad::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&)

torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&)
torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&)
torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)
torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)

End stack trace

Expected behavior

The model is trained without a crash. Or accelerate shows more informative error messages about misconfiguration.

BenjaminBossan commented 1 month ago

I have very little experience with google colab or XLA, but to me this looks like a PyTorch-XLA error and not something specific to accelerate notebook launcher or even accelerate in general. Probably you don't have an easy way to check this without accelerate? If yes, that would help to confirm it quickly. Otherwise, could you try a different model architecture than GPT2 and see if the same error occurs?

emusiienko commented 1 month ago

Hello, you're right, I don't have experience with PyTorch directly, so writing the same logic on it is out of my expertise. And, of course I don't have TPU at home.

I'll try another model, in the HF tutorials some BERT variation is also in use. I'll try it and let you know

emusiienko commented 1 month ago

Hi @BenjaminBossan

I tried another model (this is a notebook https://colab.research.google.com/drive/14_ylDC_0ptZhw8VQYFavhyazPR-Jx7CR?usp=sharing )

Tutorial I followed is https://github.com/huggingface/notebooks/blob/main/examples/accelerate_examples/simple_nlp_example.ipynb

This example isn't working either, however with multiple random errors.

Predominantly it fails with:

  1. "A process in the process pool was terminated abruptly while the future was running or pending.". (Also after few first successful steps).

Other errors:

  1. torch_xla/csrc/tensor.cpp:191 : Check failed: data()->tensor_data (the same as in original example)
  2. RuntimeError: torch_xla/csrc/runtime/pjrt_computation_client.cc:721 : Check failed: pjrt_device == pjrt_data->buffer->device()
  3. RuntimeError: Function AddcmulBackward0 returned an invalid gradient at index 1 - expected device xla:1 but got xla:0

To sum up it seems to me that there is some synchronisation issue because of which I observe this kind of race condition. (either in accelerate or in torch, can't say for sure). (Or the examples for TPU isn't up-to-date and it should be done or configured in another way)

Regards

BenjaminBossan commented 1 month ago

Thanks for testing again. I agree that it's strange that the errors are random and that this could be caused by a race condition. I asked internally if there is anyone with XLA experience who could take a look, as I'm out of my depth here.

martin-gorner commented 1 month ago

Hi

The TPUs in Colab are a bit out of date (TPUv2). Would you be able to try this on a Kaggle TPU (also available for free), which is a more modern "TPU VM v3-8" and report the results? I'm not saying it will work but it will provide more useful debug info.

emusiienko commented 1 month ago

Hi @martin-gorner , I'll try this weekend and let you know

emusiienko commented 1 month ago

@martin-gorner , The same behaviour in Kaggle. Here is the link to the notebook

https://www.kaggle.com/code/eugenemusiienko/notebook7bffa26536

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

emusiienko commented 2 weeks ago

Up