Closed raphael10-collab closed 3 years ago
I thought that the model (and training) should fit into 32 GB memory to be honest. Are you sure there wasn't any other memory consuming process running on the system (since the model crashes in the 5. epoch)?
Nevertheless, we have a setting in place that can lower memory consumption. You can reduce the maximum number of spans and mention pairs that are processed simultaneously (in single tensor operations) during training/inference. Just set the following settings in configs/docred_joint/train.yaml for both training/inference:
training:
(... other settings ...)
max_spans: 200
max_rel_pairs: 50
inference:
(... other settings ...)
max_spans: 200
max_rel_pairs: 50
The settings above work well on a 11 GB GPU (about up to 4 GB of CPU memory is also occupied). By lowering the maximum numbers, memory consumption should be reduced but also training/inference speed. You may tinker with these settings to fit your system.
Regarding TPUs: I would definitely recommend to train the model on a GPU. We get a speedup of more than 10x when training on a single GPU compared to CPU. I do not have any experience with TPUs but I also would expect a speedup here.
Regarding shapes: Since we extract pos/neg samples from documents, it is hard to guarantee equal shapes across batches. This is not ideal speed wise (as is also noted on your referenced Google document), but you should still gain significant speedups when training on GPUs (and probably also TPUs).
After modifying the training/inference settings it seems that it finally succeeded (it took a while: around 26 hours) .
Thank you!!!
During the training process I'm getting this error:
RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 42472464384 bytes. Error code 12 (Cannot allocate memory)
This is the memory footprint:
How much memory is required and what the minimum requirements (memory, cpu, storage,....) for running the training process? Which Google Cloud Architecture could be better suited? https://cloud.google.com/tpu/docs/tpus#when_to_use_tpus Do you think that Google's TPU are a good fit for Jerex Model's tensors shapes and dimensions? https://cloud.google.com/tpu/docs/tpus#shapes