Open avigupta2798 opened 3 months ago
Is it possible that the directory “data/ade” doesn’t exist? Try creating this directory manually, and the npy files should be automatically generated afterward.
Thank you for your reply. It worked completely well. Although there seems to be another issue. Could you please help in this regard as well?
Regards,
Which of these two errors occurred first?
File not found occurs at the end of every step. While the CUDA one occurs initially. pth file not found occurs after every step either in 100-10, 100-50 etc.
The “file not found” error is likely due to the failure of step 0, as each subsequent step requires loading the model from the previous step, leading to a chain of errors. Therefore, resolving the NCCL error should also fix the “file not found” issue.
I haven’t encountered this NCCL error before, so it’s challenging to give you a precise answer. However, based on the error message, it seems to be related to communication between GPUs. Could you please check if nvcc -V
displays correctly (I recommend installing CUDA 11.3), if the GPU has sufficient memory, and if the PyTorch version is 1.12.1? If all these are in order, you might want to try training other open-source code to see if it works smoothly.
If other open-source code can train smoothly using multiple GPUs, then I’ll reconsider if there might be an issue somewhere. If other open-source code also fails to train in a multi-GPU setup, then it’s more likely to be a hardware or driver issue.
Thank you for clarifying about the chain of errors. I will see to the issue causing this. I am currently using 12.3 CUDA. There must be some driver issue. Please clarify one more doubt. Is there a possibility that this code can run without multi-GPU support? On single GPU only?
If you’re using the VOC dataset, a single 24GB GPU should be sufficient. However, for the ADE dataset, I used two 24GB GPUs, so if you have a single card with more than 48GB, it should also work (perhaps 32GB might be enough, but I’m not certain).
Okay, thank for your time and replies.
Hi, Thank you for your work. I was trying to implement the training file from scripts folder. I have encountered an error. Maybe something wrong I might have done on my part. Could you tell what this is regarding. I have attached the screenshot below. Thanks,