Open ravikantb opened 7 years ago
Hi @ravikantb,
Loss going to nan usually happens when the learning rate is too high. However, I will check the implementation of the layers again to make sure there is no bug.
Can you paste the Caffe output log (like you have above) for all iterations from Iteration 0 - Iteration 20 or so? Also, can you paste your training prototxt file here?
Please note that I am currently not working on the project due to other work and hence maybe delayed with responses.
Thanks Jay
Hi @ravikantb,
When training Mask-RCNN with the ROIAlign layer, I do not get nan in the loss.
I0707 14:32:50.012135 3002 solver.cpp:229] Train net output #0: accuarcy = 0 I0707 14:32:50.012141 3002 solver.cpp:229] Train net output #1: loss_bbox = 6.75187 (* 1 = 6.75187 loss) I0707 14:32:50.012145 3002 solver.cpp:229] Train net output #2: loss_cls = 4.39446 (* 1 = 4.39446 loss) I0707 14:32:50.012151 3002 solver.cpp:229] Train net output #3: loss_mask = 2.45052e+20 (* 1 = 2.45052e+20 loss)
I have also checked the forward and backward passes of the ROIAlign layer ( I used the caffe gradient checker for the backward pass). I can have a look at your output logs, solver.prototxt, and train.prototxt.
Hi @jasjeetIM ,
Thanks for your detailed response and apologies for the delay in my response. Please find attached a zip file containing the solver.prototxt, training prototxt and sample logs for 100 iterations. I have tried with learning rates ranging from 0.001-0.000001 but the loss always becomes NaN after sometime.
Just to give you a bit more details about my implementation, I took your implementation of ROIAlign layers (CPU and GPU both) and added them in the caffe framework as per the instructions given in below two links: https://github.com/BVLC/caffe/wiki/Development https://github.com/BVLC/caffe/wiki/Simple-Example:-Sin-Layer
After that I replaced ROIPooling layers in my Faster-RCNN prototxts with ROIAlign. Since I am using alternate optimization technique there, I have sent you the prototxt that is used in training Fast-RCNN component of Faster-RCNN (stage-2 : In this stage output proposals from RPN are converted to fixed length vectors using ROIAlign/ROIPool layers)
I really appreciate you going back and checking your implementation for this issue. Please have a look at the provided documents and let me know if you find something wrong there. solver_prototxt_logs.zip
Thanks, Ravikant
Please change the below values as listed in your solver.prototxt and send logs again: display: 1 base_lr: 0.0001 clip_gradients: 100 debug_info: true
Please make sure the logs contains training for at least 100 iterations. Also, provide logs for the above solver.prototxt with ROIPooling layer used instead of ROIAlign layer on the same training set for 100 iterations. Therefore, two training logs : 1) ROIAlign 2) ROIPooling, where both training utilize the solver.prototxt values pasted above.
Lastly, do you have your code available on an online repo?
Thanks
Hi @jasjeetIM
Please find attached logs for both the runs with the changes you suggested. logs.zip
Our code base is not yet public as it is organization's property but I shall talk to my team and see if I can provide you an access to the code base. I will revert on it soon.
Honestly, I had not worked with debug mode of caffe before as I found it too verbose but it seems to output useful information in this case. I shall try to see if I can find the root cause using these logs. Meanwhile if you get time to look at these and find anything useful then please let me know.
Thanks, Ravikant
Hi @ravikantb,
Okay, thanks. Can you do the following to troubleshoot:
1) Run caffe in cpu mode. 2) Modify the code in the roi_align_layer.cpp file : https://github.com/jasjeetIM/Mask-RCNN/blob/master/external/caffe/src/caffe/layers/roi_align_layer.cpp as follows:
Add the following after line 164:
LOG(INFO) << "(h_idx, w_idx, h_idx_n, w_idx_n) = (" << h_idx << "," <<w_idx << "," << h_idx_n << "," << w_idx_n << ")"; LOG(INFO) << "Multiplier = " << multiplier[counter]; LOG(INFO) << "Data value = " << batch_data[b_index_curr[counter]]; LOG(INFO) << "Current Pooled value = " << bisampled[smp/2];
3) Recompile caffe
4) Run the same experiment as you did here https://github.com/jasjeetIM/Mask-RCNN/issues/2#issuecomment-314138508. However you only need to run this for 10 iterations as the output will be very verbose and noisy.
5) Once done, please send the logs to me again.
This will help me look at the boundary condition that may be causing 'inf' as one of the pooled values.
Hi @jasjeetIM ,
Thanks for all the help. But due to some unforeseen circumstances I have to move away from this project for 2-3 days. I will get back to you with the required data after that. Please keep this issue open till then.
Thanks, Ravikant
Hi @ravikantb,
Could you share how you added the RoIAlign layer to the include/caffe/layers/fast-rcnn-layers.hpp
(step 1 in https://github.com/BVLC/caffe/wiki/Development?
Many thanks in advance.
@MartinPlantinga : I added following code in 'fast_rcnn_layers.hpp' file to do this. Hope it helps.
(P.S.: I didn't use coding template of github as it was messing with the my code snippet.)
/ ROIAlignLayer - Region of Interest Align Layer /
template < typename Dtype >
class ROIAlignLayer : public Layer
virtual inline const char* type() const { return "ROIAlign"; }
virtual inline int MinBottomBlobs() const { return 2; } virtual inline int MaxBottomBlobs() const { return 2; } virtual inline int MinTopBlobs() const { return 1; } virtual inline int MaxTopBlobs() const { return 1; }
protected:
virtual void Forward_cpu(const vector<Blob
int channels;
int height;
int width_;
int pooledheight;
int pooledwidth;
Dtype spatialscale;
Blob
Thanks @ravikantb !!
Hi @jasjeetIM ,
I took your ROIAlign layers and integrated them with my py-Faster-RCNN to replace ROIPool layers in my code base. But I am getting bounding box loss as NAN for all the iterations so far while training Fast-RCNN (stage-2) in alternate training optimization. Did you also face any such problem while training? Please let me know. I shall dig deeper and get back if I find anything worth sharing. Following are my sample logs of the loss for your reference.
Thanks