vlfeat / matconvnet

MatConvNet: CNNs for MATLAB
Other
1.4k stars 753 forks source link

vl_nnsoftmaxloss bug? #201

Open okvol opened 9 years ago

okvol commented 9 years ago

Error message received when using GPU: Error using gpuArray/subsasgn When assigning into a GPUArray, the subscripts must contain unique values. Subscript 1 contained repeated values.

Error in vlnnsoftmaxloss (line 62) Y(c) = Y(c_) - 1;

It seems c_ has repeated elements. Is it natural? with some version of the vlnnsoftmaxloss I got: c = 88 160 286 330 528 528 613 716 871 1032 ....

The X, c, dzdy reproduced the issue can be downloaded from: http://filenurse.com/download/5a6e4fc6f9dff241c4d3e098b46d7f5c.html

BTW, c always have to start from 1, is it correct?

Thanks a lot!

okvol commented 9 years ago

Also, I see beta12 and beta13 have somewhat different vl_nnsoftmaxloss, what's the improvement?

ByungtaeAhn commented 9 years ago

Hi, okvol I have a same problem. Now, do you have some answers for your questions? @vedaldi @lenck plz give me some comments.

germanRos commented 9 years ago

Hi guys,

I believe the problem is in Matlab gpuArray, not in the softmaxloss itself. If you move Y and c_ to cpu it works. However, apparently it is true that there are repeated indices. I noticed a similar behaviour in a code I did for keeping the indices of pooling in CUDA. If I have more info I will let you know.

vedaldi commented 9 years ago

Hi, thanks for exploring this issue.

On 27 Jul 2015, at 14:42, germanRos notifications@github.com wrote:

Hi guys,

I believe the problem is in Matlab gpuArray, not in the softmaxloss itself. If you move Y and c_ to cpu it works. However, apparently it is true that there are repeated indices. I noticed a similar behaviour in a code I did for keeping the indices of pooling in CUDA. If I have more info I will let you know.

— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/201#issuecomment-125207633.

peiyunh commented 9 years ago

Anyone cares to give an example of how c_ isn't unique and why?

vedaldi commented 9 years ago

Hi, I would also like to have such an example as it is not clear to me whether this may arise.

On 18 Aug 2015, at 23:25, eric-phu notifications@github.com wrote:

Anyone cares to give an example of how c_ isn't unique and why?

— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/201#issuecomment-132356002.

peiyunh commented 9 years ago

I just recall at some point I run into this problem too. It happens when computing the indices:

c = c - 1 ;
c_ = 0:numel(c)-1 ;
c_ = 1 + ...
  mod(c_, sz(1)*sz(2)) + ...
  (sz(1)*sz(2)) * max(c(:), 0)' + ...
  (sz(1)*sz(2)*sz(3)) * floor(c_/(sz(1)*sz(2))) ;

If c is single type, when (sz(1)*sz(2)*sz(3)) * floor(c_/(sz(1)*sz(2))) is too huge comparing to mod(c_, sz(1)*sz(2)), MATLAB will suffer from accuracy issue when adding them up. This is more likely to happen when the size of input into softmax loss is huge while c is still single type.

One quick fix is casting c to double type before computing the indices.

arunmallya commented 8 years ago

This bug exists even in the latest release and manifests itself when the index space is very large, say during prediction of segmentation mask of 70+ classes. Typecasting c to double seems to work.

SyedGilani commented 8 years ago

Hi Vedaldi, I am running the beta15 release for very deep FCN. For me the same errors is generated in vl_nnloss.m. The subscripts ci in Y(ci) = Y(ci) - 1 are not unique. This is what comes up;

Error using gpuArray/subsasgn When assigning into a GPUArray, the subscripts must contain unique values. Subscript 1 contained repeated values.

Error in vl_nnloss (line 226) Y(ci) = Y(ci) - 1 ;

Name Size Bytes Class Attributes

Y 4-D 108 gpuArray
ci 4-D 5898240 single

K>> size(Y) = 384 384 57 10 K>> size(ci) = 384 384 1 10

ci is the segmentation ground truth and in this case with batchsize 100 and numSubBatches=10, the 4th dimension is 10. This problem goes away if batchsize=numSubBatches cuz then only one image is processed at a time. Hope this helps and hope you can debug this issue.

grantlj commented 8 years ago

same problem as SyedGilani, but not solved yet.

ghost commented 8 years ago

Fixed the problem by performing the operation on the CPU:

%Y(ci) = Y(ci) - 1;
Y_host = gather(Y);
Y_host(ci) = Y_host(ci) - 1;
Y = gpuArray(Y_host);

The problem always arises if i am using more then 4 images per subbatch.

Card & Setting: GTX 960 - 4GB MatConvNet Beta16 vgg-f - network

svarjo commented 8 years ago

Run into the same problem as SyedGilani, but in my case error manifested when changed the net output slightly. I tried to add a class to the system where prevously was 10 classes. With the new 11 class net the training run few rounds and I noticed that after few iterations the new configuration begins to diverge (first about 10 iterations the system converges) and the object goes to NaN and soon error comes out (about 30 iterations):

Error using gpuArray/subsasgn When assigning into a GPUArray, the subscripts must contain unique values. Subscript 1 contained repeated values. Error in vl_nnloss (line 251) Y(ci) = Y(ci) - 1 ;

The loss used here is 'softmaxlog' and the net in this case is not huge ( about 10e+06 parameters ). Cannot run even with single image as patch size...

I'm using MatConvNet v1.0beta18, GTX 970, cudnn 4, Windows 10, Matlab R2015b

Hope this helps to narrow things down...

ghost commented 8 years ago

Batchsize of one isnt a good idea. Try a size of 20 for batch and subbatch. I noticed that i have no problems when batch size=subbatch size. Have you already tried to lower your learning rate?

jiahuei commented 8 years ago

Have you checked that the number of filters in your final softmaxloss or loss layer is equal to the number of classes? This is the mistake that tripped me (silly me), after changing that the error went away.

svarjo commented 8 years ago

I suppose that could had been the problem in my case too. If I recall right - I changed the net topology and was happy again...

Firenze11 commented 8 years ago

I got the same error but finally I found it was because I inputted the wrong number of classes when initializing the network.

gd-zhang commented 8 years ago

Anyone solve it?

SyedGilani commented 8 years ago

Oh yes. This problem has been long solved by Legenbaer . Here is his solution :-

Fixed the problem by performing the operation on the CPU:

Y_host = gather(Y) Y_host(ci) = Y_host(ci) - 1 Y = gpuArray(Y_host)

Regards Gilani On 30 May 2016 13:46, "Guodong Zhang" notifications@github.com wrote:

Anyone solve it?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vlfeat/matconvnet/issues/201#issuecomment-222415804, or mute the thread https://github.com/notifications/unsubscribe/AOyr89q8nEUxi5yC2Mi_0s04Uegy8gQPks5qGnm_gaJpZM4FaAl6 .

gd-zhang commented 8 years ago

@SyedGilani Why on the CPU?

Error using gpuArray An unexpected error occurred during CUDA execution. The CUDA error was: an illegal memory access was encountered

Error in vl_nnloss (line 260) Y = gpuArray(Y_host);

nightrome commented 8 years ago

Actually doing it on the CPU is only a workaround, because duplicate indices are allowed on the CPU, but not on the GPU. So the results will be wrong in a few cases. A much better solution is to have double precision labels (as described above).

gd-zhang commented 8 years ago

@nightrome add c = double(c); in vl_nnloss?

nightrome commented 8 years ago

Yes, I'd even do it in your getBatch function already. That way it will be easier for you to update to a newer Matconvnet version when it is released..

irolaina commented 7 years ago

I have just encountered the same problem. It still exists in beta21. In my case, the problem arises from high dimensional outputs when using softmaxlog in FCN-style.

It is indeed a precision problem; reducing either the spatial resolution of the output or the batch size fixes it for me. The CPU workaround produces wrong results.

marksunpeng commented 7 years ago

I met the same problem with the version-20 (not sure about V22, although vl_nnloss almost same). I have 15 million dataset with 12 label.

I tried c = double(c); doesn't help I tried ci = unique(ci(:)); works

PS. just a tip, always make your label like label = 1:12, instead of having gap(like 1:2:24).

daofeng2007 commented 7 years ago

@Legenbaer the default batchSize is 20. When I used this default setting, there is no problem. When I tried to increase batchSize to 100, error showed up. Based on your suggestion, maybe I should change both batchSize and subbatch to 100? Where should I find the subbatch variable to make the change?

ghost commented 7 years ago

https://github.com/vlfeat/matconvnet-fcn/blob/master/fcnTrain.m#L27

Am 04.11.2016 um 17:09 schrieb Tao Liu:

@Legenbaerhttps://github.com/Legenbaer the default batchSize is 20. When I used this default setting, there is no problem. When I tried to increase batchSize to 100, error showed up. Based on your suggestion, maybe I should change both batchSize and subbatch to 100? Where should I find the subbatch variable to make the change?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/vlfeat/matconvnet/issues/201#issuecomment-258474551, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACON1wUegqcylFYFftMD2lZspW84-PNnks5q61iugaJpZM4FaAl6.