mosaicml / examples

Fast and flexible reference benchmarks
Apache License 2.0
424 stars 122 forks source link

CUDA out of memory #371

Closed mscherrmann closed 10 months ago

mscherrmann commented 1 year ago

Hi,

I tried to replicate the mosaic-BERT training on the C4 dataset. I followed step by step your guidelines. The dataset preparation worked well. However, during BERT training with the main.py file, I got a CUDA out of memory error. I did not change any hyperparamters in the respective yaml (mosaic-bert-base-uncased.yaml), except for the path of the data. I trained the model on a 8*80 GB A100 GPU.

Here is the trace:

Traceback (most recent call last): File "/examples/examples/benchmarks/bert/main.py", line 269, in main(cfg) File "/examples/examples/benchmarks/bert/main.py", line 256, in main trainer.fit() File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1766, in fit self._train_loop() File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1993, in _train_loop self._run_evaluators(Event.BATCH_END) File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2071, in _run_evaluators self._eval_loop( File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2724, in _eval_loop self._original_model.update_metric( File "/usr/lib/python3/dist-packages/composer/models/huggingface.py", line 395, in update_metric metric.update(outputs, self.labels) # pyright: ignore [reportGeneralTypeIssues] File "/usr/lib/python3/dist-packages/torchmetrics/metric.py", line 399, in wrapped_func raise err File "/usr/lib/python3/dist-packages/torchmetrics/metric.py", line 389, in wrapped_func update(*args, *kwargs) File "/usr/lib/python3/dist-packages/composer/metrics/nlp.py", line 123, in update losses = self.loss_fn(logits, target) File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/usr/lib/python3/dist-packages/torch/nn/modules/loss.py", line 1174, in forward return F.cross_entropy(input, target, weight=self.weight, File "/usr/lib/python3/dist-packages/torch/nn/functional.py", line 3026, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 59.62 GiB (GPU 0; 79.15 GiB total capacity; 68.06 GiB already allocated; 6.73 GiB free; 71.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1. ERROR:composer.cli.launcher:Global rank 0 (PID 1861369) exited with code 1

Furthermore, the training up to that point took quite long:

[sample=8192000/286720000]: Train time/batch: 1999 Train time/sample: 8187904 Train time/batch_in_epoch: 1999 Train time/sample_in_epoch: 8187904 Train time/token: 1048051712 Train time/token_in_epoch: 1048051712 Train trainer/device_train_microbatch_size: 128 Train loss/train/total: 3.7229 Train metrics/train/LanguageCrossEntropy: 3.7233 Train metrics/train/MaskedAccuracy: 0.3915 Train throughput/batches_per_sec: 0.2469 Train throughput/samples_per_sec: 1011.2352 Train throughput/device/batches_per_sec: 0.2469 Train throughput/device/samples_per_sec: 1011.2352 Train throughput/tokens_per_sec: 129438.1079 Train throughput/device/tokens_per_sec: 129438.1079 Train time/train: 2.2606 Train time/val: 0.0000 Train time/total: 2.2606 Train lr-DecoupledAdamW/group0: 0.0002

I am a bit confused as you said that a key feature of mosaic-BERT is its training speed. Do you have any idea what I am doing wrong?

Thank you in advance for your help!

Update: I saw that the out of memory issue occurs when the model is evaluated (after 2000 batches per default). I already tried to reduce the global_train_batch_size from 4096 to 2048, withour success.

jacobfulano commented 1 year ago

Hey @FinTexIFB, where is your C4 data stored? Were there any other changes you made to the environment setup etc.?

mscherrmann commented 1 year ago

Hi,

the C4 data is stored in the default location (./my-copy-c4). I did not change anything in the environment setup.

amishparekh commented 2 weeks ago

@mscherrmann Did you find a solution for this? Even I am getting the same issue