microsoft / SoftTeacher

Semi-Supervised Learning, Object Detection, ICCV2021
MIT License
900 stars 123 forks source link

Error in training #218

Open linc4ekk opened 2 years ago

linc4ekk commented 2 years ago

Hi, everybody! I am getting the following error while training a model on single GPU on Linux Ubuntu 22.04. I am new to Linux and training on local GPU. I am starting to run it in docker with the following command: ./train.sh

This is the error I receive: INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qgm2o66y/none_xv9bcqsf/attempt_3/0/error.json

After it is printed, it is stuck for couple minutes and the epoch starts and then it fails with another error (input tensor is empty, but it is not empty when I print it), which I guess, raises because of the first error. Can anybody help me with this error?