mosaicml / diffusion

Apache License 2.0
676 stars 70 forks source link

leaked shared_memory #50

Open s5248 opened 1 year ago

s5248 commented 1 year ago

frustrated after training about 1654/ba it corrupted, failed to save the checkpoint, tried two times. Error as follows:

[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=39739, OpType=ALLREDUCE, Timeout(ms)=300000) ran for 302714 milliseconds before timing out. train 4%|▉ /home/anaconda3/envs/control/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' ----------End global rank 3 STDERR---------- ERROR:composer.cli.launcher:Global rank 0 (PID 35121) has still not exited; return exit code 1.

mvpatel2000 commented 1 year ago

Can you please provide the full trace? Happy to help out :)