Closed Chunli-Dai closed 3 years ago
Hi. I am having the same error as above after the second iteration.
|=========================================================================| | Epoch | Iteration | Time Elapsed | Mini-batch | Base Learning | | | | (hh:mm:ss) | Loss | Rate | |=========================================================================| | 1 | 1 | 00:01:51 | 3.2167 | 0.0100 | | 1 | 2 | 00:04:00 | 1.4978 | 0.0100 |
hope can find a solution soon. thank you.
Regards, Cheng
@anchitdharmw, I had the same issue. I've posted a solution over at mathworks website.
The text of that answer is reproduced below:
It seems that "TrainedVariance" values sometimes become very small negative numbers (usually since they start off as very small positive numbers!).
A (very unelegant) solution, placed right above the dlnet.State = state;
line is:
isVariance = strcmp(state.Parameter, "TrainedVariance");
state.Value(isVariance) = cellfun(@(x) max(x, 1e-10), state.Value(isVariance), 'UniformOutput', false);
Essentially, I check 'TrainedVariance' values and force them to a very small positive number if they are less than (i.e., 0 or negative) that number.
I'm not sure why variance goes negative, however. That is something I will have to dig into. Any ideas?
Hi folks,
Thanks for reporting this and @akshaymehra, thanks for looking into this. I just got notified of this issue. I will investigate this and get back to you soon.
So, I've tried to reproduce this at my end, but haven't been successful yet. The environment that I used was - R2020b with GPU training. @akshaymehra - I also checked the Trained variance values after about 400 iterations and they are quite reasonable (>1).
Could you folks provide a bit more info about your running environments-
Note- batchNorm layer did have a bug related to negative variance due to precision issues, but that has been fixed in the latest update of R2020a and R2020b. The workaround posted by @akshaymehra is reasonable. Here is the bug report - https://www.mathworks.com/support/bugreports/2273095
@anchitdharmw sure! I'm running R2020b, CPU, minibatchsize = 2. I've got a 2070 8GB but the memory gets maxed out during training, so I've reverted to using the CPU. Thanks!
@anchitdharmw @akshaymehra Thank you for providing a solution! I am also running with Matlab R2020b, CPU, minibatchSize=2.
@anchitdharmw Hi Anchit, I tried the solution in this bug report https://www.mathworks.com/support/bugreports/2273095. The error is now gone. But the training is taking more than a day, and here is the screen shot of the output:
It seems to be converging but very slowly. Is this what is expected? How long does it normally take to get the results? I am running with Matlab R2020b on Mac, CPU, minibatchSize=2.
@Chunli-Dai, the training of mask-rcnn does take time and is highly recommended to be done on a GPU. I have updated the repo with 'resnet50' backbone support which you could use to lower the memory footprint and speed up the training.
Dear Anchit,
Thank you so much for sharing your awesome code of Mask R CNN on Github. I am trying to run the example file (MaskRCNNTrainingExample.mlx). I got error message in the step of training as followed.
**Error using nnet.internal.cnn.dlnetwork/forward (line 254) Layer 'bn2a_branch2a': Invalid input data. The value of 'Variance' is invalid. Expected input to be positive.
Error in nnet.internal.cnn.dlnetwork/CodegenOptimizationStrategy/propagateWithFallback (line 103) [varargout{1:nargout}] = fcn(net, X, layerIndices, layerOutputIndices);
Error in nnet.internal.cnn.dlnetwork/CodegenOptimizationStrategy/forward (line 52) [varargout{1:nargout}] = propagateWithFallback(strategy, functionSlot, @forward, net, X, layerIndices, layerOutputIndices);
Error in dlnetwork/forward (line 347) [varargout{1:nargout}] = net.EvaluationStrategy.forward(net.PrivateNetwork, x, layerIndices, layerOutputIndices);
Error in networkGradients (line 21) [YRPNRegDeltas, proposal, YRCNNClass, YRCNNReg, YRPNClass, YMask, state] = forward(...
Error in deep.internal.dlfeval (line 18) [varargout{1:nout}] = fun(x{:});
Error in dlfeval (line 41) [varargout{1:nout}] = deep.internal.dlfeval(fun,varargin{:});**
I'd appreciate your insights!
Thank you so much for your time and patience!
Sincerely, Chunli