Thanks for making the code available. I recently encountered error during finetuning Singularity-Temporal for my own dataset. While the finetuning experiment was successful for my trial experiment with a subset of the dataset, it failed at ~epoch 6 on the full-fledged dataset without an informative error message report (the batch size was same in both the experiments).
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104856 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104857 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104858 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104859 closing signal SIGHUP ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{ "message": { "message": "SignalException: Process 3104850 got signal: 1",
This seems to be an issue of GPU memory leak.
del question_input, image, answer_input at the end of training and evaluation loops in vqa.py helped me resolve the issue.
PS: I haven't tried reproducing it for the reported datasets but only for my custom dataset. Posting the issue just in case anyone else is in the same boat.
Hi,
Thanks for making the code available. I recently encountered error during finetuning Singularity-Temporal for my own dataset. While the finetuning experiment was successful for my trial experiment with a subset of the dataset, it failed at ~epoch 6 on the full-fledged dataset without an informative error message report (the batch size was same in both the experiments).
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104856 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104857 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104858 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104859 closing signal SIGHUP ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{ "message": { "message": "SignalException: Process 3104850 got signal: 1",
This seems to be an issue of GPU memory leak.
del question_input, image, answer_input
at the end of training and evaluation loops invqa.py
helped me resolve the issue.PS: I haven't tried reproducing it for the reported datasets but only for my custom dataset. Posting the issue just in case anyone else is in the same boat.
Thanks! I.