Getting bounding box loss as nan

ravikantb commented 7 years ago

Hi @jasjeetIM ,

I took your ROIAlign layers and integrated them with my py-Faster-RCNN to replace ROIPool layers in my code base. But I am getting bounding box loss as NAN for all the iterations so far while training Fast-RCNN (stage-2) in alternate training optimization. Did you also face any such problem while training? Please let me know. I shall dig deeper and get back if I find anything worth sharing. Following are my sample logs of the loss for your reference.

I0707 10:38:07.833065 10773 solver.cpp:228] Iteration 1940, loss = nan
I0707 10:38:07.833112 10773 solver.cpp:244]     Train net output #0: loss_bbox = nan (* 1 = nan loss)
I0707 10:38:07.833133 10773 solver.cpp:244]     Train net output #1: loss_cls = 87.3365 (* 1 = 87.3365 loss)

Thanks

jasjeetIM commented 7 years ago

Hi @ravikantb,

Loss going to nan usually happens when the learning rate is too high. However, I will check the implementation of the layers again to make sure there is no bug.

Can you paste the Caffe output log (like you have above) for all iterations from Iteration 0 - Iteration 20 or so? Also, can you paste your training prototxt file here?

Please note that I am currently not working on the project due to other work and hence maybe delayed with responses.

Thanks Jay

jasjeetIM commented 7 years ago

Hi @ravikantb,

When training Mask-RCNN with the ROIAlign layer, I do not get nan in the loss.

I0707 14:32:50.012135 3002 solver.cpp:229] Train net output #0: accuarcy = 0 I0707 14:32:50.012141 3002 solver.cpp:229] Train net output #1: loss_bbox = 6.75187 (* 1 = 6.75187 loss) I0707 14:32:50.012145 3002 solver.cpp:229] Train net output #2: loss_cls = 4.39446 (* 1 = 4.39446 loss) I0707 14:32:50.012151 3002 solver.cpp:229] Train net output #3: loss_mask = 2.45052e+20 (* 1 = 2.45052e+20 loss) I have also checked the forward and backward passes of the ROIAlign layer ( I used the caffe gradient checker for the backward pass). I can have a look at your output logs, solver.prototxt, and train.prototxt.

ravikantb commented 7 years ago

Hi @jasjeetIM ,

Thanks for your detailed response and apologies for the delay in my response. Please find attached a zip file containing the solver.prototxt, training prototxt and sample logs for 100 iterations. I have tried with learning rates ranging from 0.001-0.000001 but the loss always becomes NaN after sometime.

Just to give you a bit more details about my implementation, I took your implementation of ROIAlign layers (CPU and GPU both) and added them in the caffe framework as per the instructions given in below two links: https://github.com/BVLC/caffe/wiki/Development https://github.com/BVLC/caffe/wiki/Simple-Example:-Sin-Layer

After that I replaced ROIPooling layers in my Faster-RCNN prototxts with ROIAlign. Since I am using alternate optimization technique there, I have sent you the prototxt that is used in training Fast-RCNN component of Faster-RCNN (stage-2 : In this stage output proposals from RPN are converted to fixed length vectors using ROIAlign/ROIPool layers)

I really appreciate you going back and checking your implementation for this issue. Please have a look at the provided documents and let me know if you find something wrong there. solver_prototxt_logs.zip

Thanks, Ravikant

jasjeetIM commented 7 years ago

Please change the below values as listed in your solver.prototxt and send logs again: display: 1 base_lr: 0.0001 clip_gradients: 100 debug_info: true

Please make sure the logs contains training for at least 100 iterations. Also, provide logs for the above solver.prototxt with ROIPooling layer used instead of ROIAlign layer on the same training set for 100 iterations. Therefore, two training logs : 1) ROIAlign 2) ROIPooling, where both training utilize the solver.prototxt values pasted above.

Lastly, do you have your code available on an online repo?

Thanks

ravikantb commented 7 years ago

Hi @jasjeetIM

Please find attached logs for both the runs with the changes you suggested. logs.zip

Our code base is not yet public as it is organization's property but I shall talk to my team and see if I can provide you an access to the code base. I will revert on it soon.

Honestly, I had not worked with debug mode of caffe before as I found it too verbose but it seems to output useful information in this case. I shall try to see if I can find the root cause using these logs. Meanwhile if you get time to look at these and find anything useful then please let me know.

Thanks, Ravikant

jasjeetIM commented 7 years ago

Hi @ravikantb,

Okay, thanks. Can you do the following to troubleshoot:

1) Run caffe in cpu mode. 2) Modify the code in the roi_align_layer.cpp file : https://github.com/jasjeetIM/Mask-RCNN/blob/master/external/caffe/src/caffe/layers/roi_align_layer.cpp as follows:

Add the following after line 164: LOG(INFO) << "(h_idx, w_idx, h_idx_n, w_idx_n) = (" << h_idx << "," <<w_idx << "," << h_idx_n << "," << w_idx_n << ")"; LOG(INFO) << "Multiplier = " << multiplier[counter]; LOG(INFO) << "Data value = " << batch_data[b_index_curr[counter]]; LOG(INFO) << "Current Pooled value = " << bisampled[smp/2]; 3) Recompile caffe

4) Run the same experiment as you did here https://github.com/jasjeetIM/Mask-RCNN/issues/2#issuecomment-314138508. However you only need to run this for 10 iterations as the output will be very verbose and noisy.

5) Once done, please send the logs to me again.

This will help me look at the boundary condition that may be causing 'inf' as one of the pooled values.

ravikantb commented 7 years ago

Hi @jasjeetIM ,

Thanks for all the help. But due to some unforeseen circumstances I have to move away from this project for 2-3 days. I will get back to you with the required data after that. Please keep this issue open till then.

Thanks, Ravikant

MartinPlantinga commented 7 years ago

Hi @ravikantb,

Could you share how you added the RoIAlign layer to the include/caffe/layers/fast-rcnn-layers.hpp (step 1 in https://github.com/BVLC/caffe/wiki/Development?

Many thanks in advance.

ravikantb commented 6 years ago

@MartinPlantinga : I added following code in 'fast_rcnn_layers.hpp' file to do this. Hope it helps.

(P.S.: I didn't use coding template of github as it was messing with the my code snippet.)

/ ROIAlignLayer - Region of Interest Align Layer / template < typename Dtype > class ROIAlignLayer : public Layer { public: explicit ROIAlignLayer(const LayerParameter& param) : Layer(param) {} virtual void LayerSetUp(const vector<Blob>& bottom, const vector<Blob>& top); virtual void Reshape(const vector<Blob>& bottom, const vector<Blob>& top);

virtual inline const char* type() const { return "ROIAlign"; }

virtual inline int MinBottomBlobs() const { return 2; } virtual inline int MaxBottomBlobs() const { return 2; } virtual inline int MinTopBlobs() const { return 1; } virtual inline int MaxTopBlobs() const { return 1; }

protected: virtual void Forward_cpu(const vector<Blob>& bottom, const vector<Blob>& top); virtual void Forward_gpu(const vector<Blob>& bottom, const vector<Blob>& top); virtual void Backward_cpu(const vector<Blob>& top, const vector& propagate_down, const vector<Blob>& bottom); virtual void Backward_gpu(const vector<Blob>& top, const vector& propagate_down, const vector<Blob>& bottom);

int channels; int height; int width_; int pooledheight; int pooledwidth; Dtype spatialscale; Blob maxidx; Blob maxmult; Blob maxpts; };

MartinPlantinga commented 6 years ago

Thanks @ravikantb !!

jasjeetIM / Mask-RCNN

Getting bounding box loss as nan #2