INFO:tensorflow:out/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it.
242 I0125 21:17:40.027305 139845646956352 checkpoint_management.py:95] out/model.ckp t-0 is not in all_model_checkpoint_paths. Manually adding it.
243 slurmstepd: error: Job 247071 exceeded memory limit (212183588 > 209715200), being killed
I was fine-tuning RACE dataset over an ALBERT large model on a slurm server, but always got the error of exceeding memories. Already enlarged the memory to be 200g but still didn't work. Does anyone have an idea about what might have gone wrong here?
I was fine-tuning RACE dataset over an ALBERT large model on a slurm server, but always got the error of exceeding memories. Already enlarged the memory to be 200g but still didn't work. Does anyone have an idea about what might have gone wrong here?