frustrated after training about 1654/ba it corrupted, failed to save the checkpoint, tried two times.
Error as follows:
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=39739, OpType=ALLREDUCE, Timeout(ms)=300000) ran for 302714 milliseconds before timing out.
train 4%|▉ /home/anaconda3/envs/control/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
----------End global rank 3 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 35121) has still not exited; return exit code 1.
frustrated after training about 1654/ba it corrupted, failed to save the checkpoint, tried two times. Error as follows: