jinpeng0528 / STAR

Code release for "Saving 100x Storage: Prototype Replay for Reconstructing Training Sample Distribution in Class-Incremental Semantic Segmentation" (NeurIPS 2023)
12 stars 1 forks source link

npy file not found #3

Open avigupta2798 opened 3 months ago

avigupta2798 commented 3 months ago

Hi, Thank you for your work. I was trying to implement the training file from scripts folder. I have encountered an error. Maybe something wrong I might have done on my part. Could you tell what this is regarding. I have attached the screenshot below. Thanks, Screenshot from 2024-08-26 12-13-24

jinpeng0528 commented 3 months ago

Is it possible that the directory “data/ade” doesn’t exist? Try creating this directory manually, and the npy files should be automatically generated afterward.

avigupta2798 commented 3 months ago

Thank you for your reply. It worked completely well. Although there seems to be another issue. Could you please help in this regard as well? Screenshot from 2024-08-26 14-09-38 Screenshot from 2024-08-26 14-09-23

Regards,

jinpeng0528 commented 3 months ago

Which of these two errors occurred first?

avigupta2798 commented 3 months ago

File not found occurs at the end of every step. While the CUDA one occurs initially. pth file not found occurs after every step either in 100-10, 100-50 etc.

jinpeng0528 commented 3 months ago

The “file not found” error is likely due to the failure of step 0, as each subsequent step requires loading the model from the previous step, leading to a chain of errors. Therefore, resolving the NCCL error should also fix the “file not found” issue.

I haven’t encountered this NCCL error before, so it’s challenging to give you a precise answer. However, based on the error message, it seems to be related to communication between GPUs. Could you please check if nvcc -V displays correctly (I recommend installing CUDA 11.3), if the GPU has sufficient memory, and if the PyTorch version is 1.12.1? If all these are in order, you might want to try training other open-source code to see if it works smoothly.

jinpeng0528 commented 3 months ago

If other open-source code can train smoothly using multiple GPUs, then I’ll reconsider if there might be an issue somewhere. If other open-source code also fails to train in a multi-GPU setup, then it’s more likely to be a hardware or driver issue.

avigupta2798 commented 3 months ago

Thank you for clarifying about the chain of errors. I will see to the issue causing this. I am currently using 12.3 CUDA. There must be some driver issue. Please clarify one more doubt. Is there a possibility that this code can run without multi-GPU support? On single GPU only?

jinpeng0528 commented 3 months ago

If you’re using the VOC dataset, a single 24GB GPU should be sufficient. However, for the ADE dataset, I used two 24GB GPUs, so if you have a single card with more than 48GB, it should also work (perhaps 32GB might be enough, but I’m not certain).

avigupta2798 commented 3 months ago

Okay, thank for your time and replies.