Probable GPU Memory Leak In Finetuning

Hi,

Thanks for making the code available. I recently encountered error during finetuning Singularity-Temporal for my own dataset. While the finetuning experiment was successful for my trial experiment with a subset of the dataset, it failed at ~epoch 6 on the full-fledged dataset without an informative error message report (the batch size was same in both the experiments). WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104856 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104857 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104858 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104859 closing signal SIGHUP ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{ "message": { "message": "SignalException: Process 3104850 got signal: 1",

This seems to be an issue of GPU memory leak. del question_input, image, answer_input at the end of training and evaluation loops in vqa.py helped me resolve the issue.

PS: I haven't tried reproducing it for the reported datasets but only for my custom dataset. Posting the issue just in case anyone else is in the same boat.

Thanks! I.

jayleicn / singularity

Probable GPU Memory Leak In Finetuning #30