Error in training - Githubissues

Hi, everybody! I am getting the following error while training a model on single GPU on Linux Ubuntu 22.04. I am new to Linux and training on local GPU. I am starting to run it in docker with the following command: ./train.sh

This is the error I receive: INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qgm2o66y/none_xv9bcqsf/attempt_3/0/error.json

After it is printed, it is stuck for couple minutes and the epoch starts and then it fails with another error (input tensor is empty, but it is not empty when I print it), which I guess, raises because of the first error. Can anybody help me with this error?

microsoft / SoftTeacher

Error in training #218