google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.24k stars 569 forks source link

Exceeding Memory #137

Open xiamengzhou opened 4 years ago

xiamengzhou commented 4 years ago

INFO:tensorflow:out/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it. 242 I0125 21:17:40.027305 139845646956352 checkpoint_management.py:95] out/model.ckp t-0 is not in all_model_checkpoint_paths. Manually adding it. 243 slurmstepd: error: Job 247071 exceeded memory limit (212183588 > 209715200), being killed

I was fine-tuning RACE dataset over an ALBERT large model on a slurm server, but always got the error of exceeding memories. Already enlarged the memory to be 200g but still didn't work. Does anyone have an idea about what might have gone wrong here?

urextra commented 4 years ago

I met the same problem when I was fine-tuning squad2.0 dataset, but fortunely it does not affect me getting the results--the ckpt file