I experimented with your release_draft classification examples and I think there is an issue with the LayerNorm layer on CUDA.
In the test I ran, the cap updates seems to work fine across architecture for both CPU & GPU. The issue seems to originate from LayerNorm and GPU. Whether you try running FNN_LAYERNORM or CNN_LAYERNORM, both architectures work well on CPU regardless of the batch size. However, both fails on CUDA regardless of the batch size. This is what leads me to think that the issue is with LayerNorm & CUDA rather than with capping the updates.
I experimented with your
release_draft
classification examples and I think there is an issue with the LayerNorm layer on CUDA.In the test I ran, the cap updates seems to work fine across architecture for both CPU & GPU. The issue seems to originate from LayerNorm and GPU. Whether you try running
FNN_LAYERNORM
orCNN_LAYERNORM
, both architectures work well on CPU regardless of the batch size. However, both fails on CUDA regardless of the batch size. This is what leads me to think that the issue is with LayerNorm & CUDA rather than with capping the updates.