steventrouble / EfficientZero

Fork of EfficientZero to use newer libraries and to fix a few runtime bugs. Also includes pretrained models!
GNU General Public License v3.0
7 stars 2 forks source link

GPU Worker crashes with no error or stack dump #2

Closed steventrouble closed 2 years ago

steventrouble commented 2 years ago

Whenever the GPU worker starts training, it immediately crashes with no error. Instead, all I get is

2022-07-07 19:05:41,609 WARNING worker.py:1404 -- A worker died or was killed
while executing a task by an unexpected system error. To troubleshoot the
problem, check the logs for the dead worker. 
RayTask ID: ffffffffffffffff2aeefb9774b8f9463ffdfd8101000000
Worker ID: 21602417afb8b58af6db10cb511242afac87db0eb5b09f5606320616 
Node ID: 51e55bca411e9e811bf5c67089cbf9867f5f8374c2fce4a8370c987c 
Worker IP address: *
Worker port: *
Worker PID: 6218

I grepped through all the logs and stdout, but can't find any information about what the error was, or where it occurred.

steventrouble commented 2 years ago

I managed to work around this by adding num_restarts=10 to the GPU and CPU worker. Looks like the error was a temp error.