BatchNormalization.cu fails for large inputs

tverbele commented 8 years ago

When using BatchNormalization with large inputs I get an error: THCudaCheck FAIL THCUNN/BatchNormalization.cu line=257 error=11 : invalid argument

I suspect this has a similar cause as https://github.com/torch/cunn/issues/284 when having an input size larger than 65535.

soumith commented 8 years ago

can you give an example input size? should be an easy fix once i get that.

tverbele commented 8 years ago

I had the problem with a 32x32x64 tensor, so total input size 65536

soumith commented 8 years ago

@colesbury . Might have to split the launch into two kernels? or is there a cleaner way?

colesbury commented 8 years ago

I still don't understand this. Which batch norm module are you using (BatchNormalization, SpatialBatchNormalization, VolumetricBatchNormalization)? A 3D input doesn't seem valid for any of them. BN=2D, SBN=4D, VBN=5D.

tverbele commented 8 years ago

Sorry for the confusion. I forgot the batch dimension, so I have 100x64x32x32.

I am actually using the THCUNN library as such with my own framework on top. However, I notice that I reshape to a 2D tensor, in this case 100x65536. I guess I just have to leave it as a 4D tensor?

colesbury commented 8 years ago

If you're using THCUNN directly, the input and output should be 3D tensors. For SpatialBatchNorm: 100x64x1024

(i.e. don't fold the activations into the feature dimension)

tverbele commented 8 years ago

Ok, thanks for the clarification! I think this issue can be closed then?

tverbele commented 8 years ago

Another question. When I feed 3D tensors to THCUNN, i.e. a 100x64x1024 Tensor, I stumble upon cuda runtime error (77) : an illegal memory access was encountered. However, this error does not arise directly after the THCUNN batchnorm call, but pops up in later calculations. Any idea what could be wrong? Any hint appreciated!

soumith commented 8 years ago

to place exact CUDA errors, run the program with: CUDA_LAUNCH_BLOCKING=1 th [yourscript.lua]

Otherwise, stack traces aren't accurate.

tverbele commented 8 years ago

Thanks. It indeed seems to go wrong with BatchNormalization. THCudaCheck FAIL file=BatchNormalization.cu line=257 error=77 : an illegal memory access was encountered

I suspect it goes wrong in this part (r188 - r194):

  float mean = reduce<float>(SumOp(input), input, plane) * norm;
  __syncthreads();
  float varN = reduce<float>(VarOp(mean, input), input, plane);
  float invStd = 0.0f;
  if (varN != 0.0f || epsilon != 0.0f) {
    invStd = 1 / sqrt(varN * norm + epsilon);
  }

since the error does not occur if I comment this part out...

Any clue what I am doing wrong?

tverbele commented 8 years ago

I have further investigated this ... apparently I get the illegal memory access from the moment my features size becomes larger than 256. So for example:

a batch of dims 100x32x16x16 is reshaped to a 3D tensor with dims 100x32x256 which works ok a batch of dims 100x32x17x17 is reshaped to a 3D tensor with dims 100x32x289 which fails a batch of dims 100x32x1x257 is reshaped to a 3D tensor with dims 100x32x257 which fails

tverbele commented 8 years ago

FYI: I found what is going wrong. First of all, I was compiling for the wrong GPU architecture, which caused my build to skip r60 #if __CUDA_ARCH__ >= 300 and jump into the else bit. There a max block size is defined of 256, while getNumThreads might return 512, causing the memory error in the reduce function. When I correctly compile for my GPU arch then the __shfl_xor is used and everything seems to work as it should.

colesbury commented 8 years ago

Oooh, that's a bug. MAX_BLOCK_SIZE should be 512.

torch / cunn

BatchNormalization.cu fails for large inputs #298