RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory)

raphael10-collab commented 3 years ago

During the training process I'm getting this error:

RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory)

Epoch 5:  98%|███████████████████████████████████████████████████████████████████████████████████████████████▎ | 3250/3308 [1:10:30<01:15,  1.30s/it, loss=0.165, v_num=0_0]Traceback (most recent call last):████████████████████████████████████████████████████████████████████████████████▏                       | 241/300 [02:57<00:39,  1.50it/s]
  File "./jerex_train.py", line 20, in train
    model.train(cfg)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 341, in train
    trainer.fit(model, datamodule=data_module)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
    self.train_loop.run_training_epoch()
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 576, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 174, in evaluation_step
    output = self.trainer.accelerator.validation_step(args)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in validation_step
    return self.training_type_plugin.validation_step(*args)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 126, in validation_step
    return self._inference(batch, batch_idx)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 176, in _inference
    output = self(**batch, inference=True)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/model.py", line 106, in forward
    max_rel_pairs=max_rel_pairs, inference=inference)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/joint_models.py", line 144, in forward
    return self._forward_inference(*args, **kwargs)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/joint_models.py", line 233, in _forward_inference
    max_pairs=max_rel_pairs)
  File "/home/marco/.pyenv/versions/PyTorch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/modules/relation_classification_multi_instance.py", line 49, in forward
    chunk_rel_sentence_distances, mention_reprs, chunk_h)
  File "/home/marco/PyTorchMatters/EntitiesRelationsExtraction/jerex/jerex/models/modules/relation_classification_multi_instance.py", line 73, in _create_mention_pair_representations
    rel_ctx = m + h
RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 5:  98%|███████████████████████████████████████████████████████████████████████████████████████████████▎ | 3250/3308 [1:10:48<01:15,  1.31s/it, loss=0.165, v_num=0_0]

This is the memory footprint:

 (PyTorch1.7) (base) marco@pc:~/PyTorchMatters/EntitiesRelationsExtraction/jerex$ free -m                                                                                     
              total        used        free      shared  buff/cache   available
Mem:          32059        1670       20587         110        9802       29827
Swap:           979           0         979

How much memory is required and what the minimum requirements (memory, cpu, storage,....) for running the training process? Which Google Cloud Architecture could be better suited? https://cloud.google.com/tpu/docs/tpus#when_to_use_tpus Do you think that Google's TPU are a good fit for Jerex Model's tensors shapes and dimensions? https://cloud.google.com/tpu/docs/tpus#shapes

markus-eberts commented 3 years ago

I thought that the model (and training) should fit into 32 GB memory to be honest. Are you sure there wasn't any other memory consuming process running on the system (since the model crashes in the 5. epoch)?

Nevertheless, we have a setting in place that can lower memory consumption. You can reduce the maximum number of spans and mention pairs that are processed simultaneously (in single tensor operations) during training/inference. Just set the following settings in configs/docred_joint/train.yaml for both training/inference:

training:
(... other settings ...)
max_spans: 200
max_rel_pairs: 50

inference:
(... other settings ...)
max_spans: 200
max_rel_pairs: 50

The settings above work well on a 11 GB GPU (about up to 4 GB of CPU memory is also occupied). By lowering the maximum numbers, memory consumption should be reduced but also training/inference speed. You may tinker with these settings to fit your system.

Regarding TPUs: I would definitely recommend to train the model on a GPU. We get a speedup of more than 10x when training on a single GPU compared to CPU. I do not have any experience with TPUs but I also would expect a speedup here.

Regarding shapes: Since we extract pos/neg samples from documents, it is hard to guarantee equal shapes across batches. This is not ideal speed wise (as is also noted on your referenced Google document), but you should still gain significant speedups when training on GPUs (and probably also TPUs).

raphael10-collab commented 3 years ago

After modifying the training/inference settings it seems that it finally succeeded (it took a while: around 26 hours) .

Thank you!!!

lavis-nlp / jerex

RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory) #3