Error using nnet.internal.cnn.dlnetwork/forward (line 254)

Chunli-Dai commented 3 years ago

Dear Anchit,

Thank you so much for sharing your awesome code of Mask R CNN on Github. I am trying to run the example file (MaskRCNNTrainingExample.mlx). I got error message in the step of training as followed.

**Error using nnet.internal.cnn.dlnetwork/forward (line 254) Layer 'bn2a_branch2a': Invalid input data. The value of 'Variance' is invalid. Expected input to be positive.

Error in nnet.internal.cnn.dlnetwork/CodegenOptimizationStrategy/propagateWithFallback (line 103) [varargout{1:nargout}] = fcn(net, X, layerIndices, layerOutputIndices);

Error in nnet.internal.cnn.dlnetwork/CodegenOptimizationStrategy/forward (line 52) [varargout{1:nargout}] = propagateWithFallback(strategy, functionSlot, @forward, net, X, layerIndices, layerOutputIndices);

Error in dlnetwork/forward (line 347) [varargout{1:nargout}] = net.EvaluationStrategy.forward(net.PrivateNetwork, x, layerIndices, layerOutputIndices);

Error in networkGradients (line 21) [YRPNRegDeltas, proposal, YRCNNClass, YRCNNReg, YRPNClass, YMask, state] = forward(...

Error in deep.internal.dlfeval (line 18) [varargout{1:nout}] = fun(x{:});

Error in dlfeval (line 41) [varargout{1:nout}] = deep.internal.dlfeval(fun,varargin{:});**

I'd appreciate your insights!

Thank you so much for your time and patience!

Sincerely, Chunli

maohong30 commented 3 years ago

Hi. I am having the same error as above after the second iteration.

|=========================================================================| | Epoch | Iteration | Time Elapsed | Mini-batch | Base Learning | | | | (hh:mm:ss) | Loss | Rate | |=========================================================================| | 1 | 1 | 00:01:51 | 3.2167 | 0.0100 | | 1 | 2 | 00:04:00 | 1.4978 | 0.0100 |

hope can find a solution soon. thank you.

Regards, Cheng

akshaymehra commented 3 years ago

@anchitdharmw, I had the same issue. I've posted a solution over at mathworks website.

The text of that answer is reproduced below:

It seems that "TrainedVariance" values sometimes become very small negative numbers (usually since they start off as very small positive numbers!).

A (very unelegant) solution, placed right above the dlnet.State = state; line is:

isVariance = strcmp(state.Parameter, "TrainedVariance");
state.Value(isVariance) = cellfun(@(x) max(x, 1e-10), state.Value(isVariance), 'UniformOutput', false);

Essentially, I check 'TrainedVariance' values and force them to a very small positive number if they are less than (i.e., 0 or negative) that number.

I'm not sure why variance goes negative, however. That is something I will have to dig into. Any ideas?

anchitdharmw commented 3 years ago

Hi folks,

Thanks for reporting this and @akshaymehra, thanks for looking into this. I just got notified of this issue. I will investigate this and get back to you soon.

anchitdharmw commented 3 years ago

So, I've tried to reproduce this at my end, but haven't been successful yet. The environment that I used was - R2020b with GPU training. @akshaymehra - I also checked the Trained variance values after about 400 iterations and they are quite reasonable (>1).

Could you folks provide a bit more info about your running environments-

MATLAB release.
CPU/GPU
minibatch size

Note- batchNorm layer did have a bug related to negative variance due to precision issues, but that has been fixed in the latest update of R2020a and R2020b. The workaround posted by @akshaymehra is reasonable. Here is the bug report - https://www.mathworks.com/support/bugreports/2273095

akshaymehra commented 3 years ago

@anchitdharmw sure! I'm running R2020b, CPU, minibatchsize = 2. I've got a 2070 8GB but the memory gets maxed out during training, so I've reverted to using the CPU. Thanks!

Chunli-Dai commented 3 years ago

@anchitdharmw @akshaymehra Thank you for providing a solution! I am also running with Matlab R2020b, CPU, minibatchSize=2.

Chunli-Dai commented 3 years ago

@anchitdharmw Hi Anchit, I tried the solution in this bug report https://www.mathworks.com/support/bugreports/2273095. The error is now gone. But the training is taking more than a day, and here is the screen shot of the output:

It seems to be converging but very slowly. Is this what is expected? How long does it normally take to get the results? I am running with Matlab R2020b on Mac, CPU, minibatchSize=2.

anchitdharmw commented 3 years ago

@Chunli-Dai, the training of mask-rcnn does take time and is highly recommended to be done on a GPU. I have updated the repo with 'resnet50' backbone support which you could use to lower the memory footprint and speed up the training.

matlab-deep-learning / mask-rcnn

Error using nnet.internal.cnn.dlnetwork/forward (line 254) #2