Launch multiple kernels for large batch sizes (> 65535) & Use int64 index

ucbrise / actnn

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

MIT License

196 stars 30 forks source link

Launch multiple kernels for large batch sizes (> 65535) & Use int64 index #4

Closed merrymercy closed 3 years ago

merrymercy commented 3 years ago

If the batch size is larger than 65535, it will trigger a CUDA block size limitation. This PR fixes this issue by launching multiple kernels for batch size > 65535.

This PR also updates all int32 index to int64 to prevent potential integer overflow.