Closed danieldimit closed 5 years ago
Update: I figured out when the nan problem appears. It happens when the object in the dataset is too near the camera. Notice on the right picture the object is much closer. I am not sure why this is happening, but I guess it has something to do with the data augmentation. I guess that the method scales the image and when the object takes most of the image when the images gets zoomed in the object gets out of bounds which leads to nan.
I've now created datasets where the object is further away - one with PSP(1200 images) and one with a GUITAR(1200 images). Both of them learn but not nearly as fast as the synthetic APE dataset. The APE object had 5px acc 86% on epoch 100. The PSP dataset has 5px acc 13% epoch 590. The GUITAR dataset has 5px acc 8% on epoch 350. I will leave the PSP and GUITAR datasets to be learning during the weekend to see how much the accuracy would improve.
Hi @danieldimit, thanks for your question and finding about the issue with convergence when the object is too large that it takes most of the image.
About the accuracy numbers:
Hi @btekin, I've solved the problem. It wasn't connected to your method. It turned out that something was wrong with the GPU CUDA setup on the servers at work (probably misalignment between GPU, CUDA version and driver version). After trying for a month and getting inconsistent results (the problem could occur after 10 or after 10000 iterations), I decided to try it on a Google Cloud GPU and bingo - with the same code, same dataset and the same training configuration it worked like a charm from the first time.
I tried setting the confidence regularization parameter to 0, but it didn't work. So for my experiments I trained it without pretraining. This is the confidence regularization parameter, right?
I tried setting the confidence regularization parameter to 0, but it didn't work. So for my experiments I trained it without pretraining. This is the confidence regularization parameter, right?
Yes, that is the confidence regularization parameter. With pre-training, the convergence might be a bit faster, but without it, it should also be fine.
Great to hear that you resolved the initial problem by switching to another compute platform!
Hi there I've been trying to train my model on a custom object for a while now. After failing many times I decided to do it incrementally.
Here is a comparison between the 2 datasets that I've created:
And the worst part is that the PSP training stops because of this exception:
Has anyone else had this problem? What could be any potential causes in your opinion? I would be glad to hear from anyone. Thanks in advance!
P.S. I've used the correct diameter in the *.data file, so that shouldn't be the problem. My label files should also be correct otherwise dataset from step 2 wouldn't have worked. You can see the numbering in the the picture.