Issue with LayerNorm & CUDA

I experimented with your release_draft classification examples and I think there is an issue with the LayerNorm layer on CUDA.

In the test I ran, the cap updates seems to work fine across architecture for both CPU & GPU. The issue seems to originate from LayerNorm and GPU. Whether you try running FNN_LAYERNORM or CNN_LAYERNORM, both architectures work well on CPU regardless of the batch size. However, both fails on CUDA regardless of the batch size. This is what leads me to think that the issue is with LayerNorm & CUDA rather than with capping the updates.

lhnguyen102 / cuTAGI

Issue with LayerNorm & CUDA #60