Closed tverbele closed 8 years ago
can you give an example input size? should be an easy fix once i get that.
I had the problem with a 32x32x64 tensor, so total input size 65536
@colesbury . Might have to split the launch into two kernels? or is there a cleaner way?
I still don't understand this. Which batch norm module are you using (BatchNormalization, SpatialBatchNormalization, VolumetricBatchNormalization)? A 3D input doesn't seem valid for any of them. BN=2D, SBN=4D, VBN=5D.
Sorry for the confusion. I forgot the batch dimension, so I have 100x64x32x32.
I am actually using the THCUNN library as such with my own framework on top. However, I notice that I reshape to a 2D tensor, in this case 100x65536. I guess I just have to leave it as a 4D tensor?
If you're using THCUNN directly, the input and output should be 3D tensors. For SpatialBatchNorm:
100x64x1024
(i.e. don't fold the activations into the feature dimension)
Ok, thanks for the clarification! I think this issue can be closed then?
Another question. When I feed 3D tensors to THCUNN, i.e. a 100x64x1024
Tensor, I stumble upon cuda runtime error (77) : an illegal memory access was encountered
. However, this error does not arise directly after the THCUNN batchnorm call, but pops up in later calculations. Any idea what could be wrong? Any hint appreciated!
to place exact CUDA errors, run the program with: CUDA_LAUNCH_BLOCKING=1 th [yourscript.lua]
Otherwise, stack traces aren't accurate.
Thanks. It indeed seems to go wrong with BatchNormalization. THCudaCheck FAIL file=BatchNormalization.cu line=257 error=77 : an illegal memory access was encountered
I suspect it goes wrong in this part (r188 - r194):
float mean = reduce<float>(SumOp(input), input, plane) * norm;
__syncthreads();
float varN = reduce<float>(VarOp(mean, input), input, plane);
float invStd = 0.0f;
if (varN != 0.0f || epsilon != 0.0f) {
invStd = 1 / sqrt(varN * norm + epsilon);
}
since the error does not occur if I comment this part out...
Any clue what I am doing wrong?
I have further investigated this ... apparently I get the illegal memory access from the moment my features size becomes larger than 256. So for example:
a batch of dims 100x32x16x16
is reshaped to a 3D tensor with dims 100x32x256
which works ok
a batch of dims 100x32x17x17
is reshaped to a 3D tensor with dims 100x32x289
which fails
a batch of dims 100x32x1x257
is reshaped to a 3D tensor with dims 100x32x257
which fails
FYI: I found what is going wrong. First of all, I was compiling for the wrong GPU architecture, which caused my build to skip r60 #if __CUDA_ARCH__ >= 300
and jump into the else bit. There a max block size is defined of 256, while getNumThreads
might return 512, causing the memory error in the reduce function.
When I correctly compile for my GPU arch then the __shfl_xor
is used and everything seems to work as it should.
Oooh, that's a bug. MAX_BLOCK_SIZE should be 512.
When using BatchNormalization with large inputs I get an error:
THCudaCheck FAIL THCUNN/BatchNormalization.cu line=257 error=11 : invalid argument
I suspect this has a similar cause as https://github.com/torch/cunn/issues/284 when having an input size larger than 65535.